Infrared Small Target Detection Based on Tensor Tree Decomposition and Self-Adaptive Local Prior

: Infrared small target detection plays a crucial role in both military and civilian systems. However, current detection methods face signiﬁcant challenges in complex scenes, such as inaccurate background estimation, inability to distinguish targets from similar non - target points, and poor robustness across various scenes. To address these issues, this study presents a novel spatial– temporal tensor model for infrared small target detection. In our method, we introduce the tensor tree rank to capture global structure in a more balanced strategy, which helps achieve more accurate background estimation. Meanwhile, we design a novel self -adaptive local prior weight by evaluating the level of clutter and noise content in the image. It mitigates the imbalance between target enhancement and background suppression. Then, the spatial–temporal total variation (STTV) is used as a joint regularization term to help better remove noise and obtain better detection performance. Finally, the proposed model is eﬃciently solved by the alternating direction multiplier method (ADMM). Extensive experiments demonstrate that our method achieves superior detection performance when compared with other state - of - the - art methods in terms of target enhancement, background suppression, and robustness across various complex scenes. Furthermore, we conduct an ablation study to validate the eﬀectiveness of each module in the proposed model.


Introduction
In comparison to active radar imaging, infrared imaging offers the benefits of enhanced portability and improved concealment.Meanwhile, compared with visible light systems, it boasts a range of advantages, such as exceptional anti-interference features and the ability to operate all throughout the day [1].Owing to the superior benefits of infrared imaging, infrared dim and small target detection plays a significant role in military and civil applications, such as aerospace technology [2], security surveillance [3], and forest fire prevention [4].However, due to the lengthy detection distance, infrared targets usually occupy only a few pixels and lack shape information and textural features.In addition, infrared images in complex scenes often contain a variety of interferences (e.g., heavy clutter and prominent suspicious targets), resulting in a weak signal-to-clutter ratio (SCR) [5].Therefore, infrared small target detection remains a challenging issue and has attracted widespread research interests.

Related Works
In general, infrared small target detection algorithms primarily include single-frame detection and sequential-frame detection [6].For a long time, many single-frame detection approaches have been developed to address the challenges in infrared small target detection.Single-frame detection methods can be divided into four categories: (1) background consistency-based methods; (2) human visual system (HVS)-based methods; (3) deep learning (DL)-based methods; and (4) low-rank and sparse decomposition (LRSD)-based methods.

•
Background consistency-based methods achieve target enhancement and background suppression based on the assumption of background consistency.Typical methods include the Top-hat filter [7], Max-Mean and Max-Median filters [8], and the high-pass filter [9].Hadhoud and Thomas [10] extended the LMS algorithm [11] and proposed a two-dimensional adaptive least mean square filter (TDLMS).Cao and Sun [12] utilized the maximum inter-class variance method to improve morphological filtering.Although these methods are capable of achieving fast detection speeds, they are unsuitable for application in complex scenes.

•
Contrast is the most crucial factor encoded in our visual system; HVS-based methods generally utilize visual saliency features to distinguish the target from the background.Chen et al. [13] proposed a local contrast method (LCM) to describe the difference between the target and its neighborhood.Inspired by LCM, many methods based on local contrast improvement have been proposed.Starting from the perspective of image patch difference, Wei et al. [14] presented a multiscale patch-based contrast measure (MPCM).Shi et al. [15] proposed a high-boost-based multiscale local contrast measure (HBMLCM).Han et al. [16] designed a multiscale tri-layer local contrast measure (TLLCM) to compute comprehensive contrast.Han et al. [17] improved the detection accuracy by utilizing the Laplacian filter and proposed a coarseto-fine structure (MCFS) for infrared small moving target detection.However, when the image contains background edges and pixel-sized noises with high brightness (PNHB), such algorithms usually display high false alarms.

•
With the development of artificial neural networks, DL-based methods have received extensive attention for their application in infrared target detection.Fan et al. [18] improved the convolutional neural network to extract infrared image features, aiming to improve detection accuracy and efficiency.Zhao et al. [19] designed an architecture of generative adversarial network (GAN), which models the detection problem issue as an image-to-image translation problem.In [20], a novel Dim2Clear network (Dim2Clear) was proposed to solve the problem of noise interference.Recently, Ying et al. [21] developed a label evolution framework with single point supervision.
Although DL-based methods can achieve good detection performance under training scenes, their generalization to practical applications remains a challenge.

•
In recent years, LRSD-based methods have achieved great success and can now effectively separate the low rank background and the sparse target of infrared image.Gao et al. [22] first proposed an infrared patch-image model (IPI) by constructing local patches.Consequently, infrared small target detection is transformed into an optimization problem.However, as the nuclear norm minimization (NNM) uses the same threshold to shrink singular values, an over-shrinkage problem may occur in complex backgrounds full of interference [23].Furthermore, besides the target, edges and corners in the background are also considered as sparse components under the  1 -norm [24].To handle the above problems, Dai et al. [25] constructed a non-negative infrared patch-image model (NIPPS) by adding a non-negative constraint to the sparse target.Wang et al. [26] introduced the total variation regularization that better removes the noise and proposed a total variation regularization and principal component pursuit model (TV-PCP).Zhang et al. [27] designed a nonconvex rank approximation minimization (NRAM) by utilizing the  2,1 -norm to constrain the remaining edges.Assuming that the background comes from multiple subspaces, the stable multi-subspace learning model (SMSL) [28] and the self-regularized weighted sparse model (SRWS) [29] were proposed to improve detection performance.In order to better extract the image structure information and meet the practical demand for fast detection speed, Dai and Wu [30] adopted the tensor structure and proposed a reweighted infrared patch-tensor model (RIPT).Zhang and Peng [31] combined the partial sum of the tensor nuclear norm (PSTNN) and the local prior to effectively improve detection efficiency.In [32], the tensor fibered nuclear norm based on the Log operator (LogTFNN) was used to nonconvex approximate the tensor rank, which helps suppress background and noise.Zhang et al. [33] constructed a non-local block tensor and an adaptive compromising factor based on the image local entropy.Then, a self-adaptive and non-local patch-tensor model (ANLPT) was proposed for infrared small target detection.
Although the above LRSD-based single-frame detection methods have achieved good results, they ignore temporal information.Traditional sequential-frame detection methods, such as 3D matched filtering [34], dynamic programming algorithms [35], the spatiotemporal saliency approach [36], and trajectory consistency [37], face challenges in effectively separating the background from the target.In addition, these methods usually require prior knowledge, which is difficult to obtain in practical applications.In order to exploit the spatial-temporal information that is neglected in LRSD-based single-frame detection approaches, Sun et al. [38] stacked images from successive adjacent frames.Inspired by this, Zhang et al. [39] proposed a novel spatial-temporal tensor model with edge-corner awareness to further improve detection ability.Considering that the Laplace operator can approximate the tensor rank more accurately, Hu et al. [40] proposed a multiframe spatial-temporal patch-tensor model (MFSTPT).Wang et al. [41] integrated the nonoverlapping patch spatial-temporal tensor model (NPSTT) and the tensor capped nuclear norm (TCNN) for detection results with low false alarms.Further, Liu et al. [42] designed a nonconvex tensor Tucker decomposition method, in which factor prior was used to obtain accurate background estimation and reduce computational complexity.

Motivation
Compared with background consistency-based approaches and HVS-based approaches, low-rank and sparse tensor decomposition (LRSTD)-based methods can better enhance small target features and suppress background clutter interference.Among these approaches, single-frame detection methods only consider single-frame image to construct the optimization model and struggle to achieve accurate results in various challenging environments with dynamic change or heavy clutter.Considering the significance of combining contextual information in the spatial-temporal domain, this article primarily concentrates on sequential-frame infrared target detection.While currently available methods have achieved relatively good detection performance, there are still some issues that need to be addressed.
First, due to the complex multilinear structure of the tensor, the exact approximation of the background tensor rank is always a major difficulty.To improve the accuracy of background estimation, these LRSTD-based methods [30,31,43] focus on designing more accurate tensor rank constraints, such as the sum of nuclear norm (SNN), tensor nuclear norm (TNN), and tensor train nuclear norm.Nevertheless, it has been proven that SNN fails to accurately approximate the tensor rank [44].According to the definition of TNN, it lacks flexibility and the ability to measure low-rankness from multiple modes [45].Although tensor train rank has a well-balanced matricization scheme, it suffers from higher storage requirements [46].In summary, the approximation of the background tensor rank still needs to be improved.Therefore, we apply tensor tree rank to separate target and background.Compared with the above strategies, tensor tree decomposition is a more balanced method that splits the modes of a tensor in a hierarchical way.
In addition to accurate background tensor estimation, the suppression of strong edges and corner points is key to achieving good detection performance.The local structure prior is often used to suppress interference.The RIPT only focuses on the edge structure information of the background, which may lead to false alarms.Likewise, the fixed prior weights used in PSTNN and MFSTPT cannot effectively suppress clutter in diverse scenes with different levels of interference.It is important to balance the enhancement of the target and the suppression of the interference from edges and corners in different scenes.To solve this problem, we propose a self-adaptive local prior method to adaptively suppress background clutter.Moreover, we use spatial-temporal total variation (STTV) to explore local smooth information.This strategy helps us to better remove the background noise.By combining tensor tree decomposition, self-adaptive local prior, and STTV, our method can accurately detect small targets.In the following sections, we refer to the proposed method as the TTALP-TV method.We present the results of qualitative and quantitative experiments to demonstrate that TTALP-TV surpasses other state-of-the-art methods in terms of target enhancement and background suppression in various complex scenes.Figure 1 presents the flowchart of our method.The main contributions of this article can be summarized as follows: (1) In order to approximate the tensor rank function more flexibly and accurately, we introduce tensor tree decomposition to exploit spatial and temporal correlation through a hierarchical structure.(2) The self-adaptive local prior is proposed as a target weight, which can not only better extract target information but also more effectively remove background clutter.Simultaneously, we impose STTV regularization constraint on the background to preserve image details and reduce noise interference.(3) We integrate the tensor tree rank, self-adaptive local prior, and STTV for infrared small target detection.An efficient optimization scheme using the alternating direction multiplier method (ADMM) is introduced to solve the proposed model.
The remaining sections of this article are organized as follows.Section 2 summarizes the notations and preliminaries of tensor tree decomposition.Section 3 introduces the TTALP-TV model and describes its optimization procedure in detail.In Section 4, we demonstrate the effectiveness of the proposed algorithm through extensive experiments and analyses.Finally, Section 5 concludes this article and discusses the future work.
The specific explanations of the symbols used are given in Table 1.The  0 -norm, the number of non-zero elements in  ‖‖ 1 The  1 -norm, the sum of non-zero elements in 

‖‖ 𝐹𝐹
The Frobenius norm, the square root of the sum of the squares of all elements in  ‖‖ * The matrix nuclear norm, the sum of all singular values in  Tensor Tree Network Definition 1 (Tensor tree structure) [47].For th-order data, we define a binary tree  with root  as its dimension tree.Each node   ⊂ ,  = 1, ⋯ ,  possesses the following attributes: 1.The node with only one entry is a leaf, i.e.,   = {}.The set of all leaf nodes can be represented as follows: where  is the number of leaves.
2. The node consisting of two disjoint successors is an interior node.The set of all interior nodes is denoted by: And  −  represents the number of interior nodes.
3. The tree distance ℎ�  � is the distance between the node   ⊂  and the root, with a maximum depth of .At depth ℎ,  ℎ and  ℎ denote the number of leaves and total nodes, respectively.
Definition 2 (Matricization) [47].Given a node of dimension indices   ⊂  and its complement   =  ∖   , the matricization of a tensor  ∈   1 ×⋯×  is defined as: where Definition 3 (Tensor tree rank) [47].Let  be a dimension tree of a th-order tensor, the tensor tree rank is the set of ranks of the matricization for each node, in the form of: Definition 4 (Tensor tree decomposition) [47].Given  ∈   1 ×⋯×  , for every node   ⊂ ,  [] can be written as: where    is the standard matrix rank of  [𝑞𝑞] .For each   ⊂ () with two disjoint successors  1 and  2 , the column vectors   (: , ) of   can be expressed as: where   (,  1 ,  2 ) is the coefficient of the linear combination.Figure 2 graphically illus- trates the tensor tree decomposition of a 4th-order tensor, providing an intuitive understanding of its structure.

Spatial-Temporal Infrared Tensor Model
According to the characteristic analysis in [22], the original infrared images can be linearly modeled as: where   ,   ,   , and   denote the background image, target image, infrared image, and noise image, respectively.Equation ( 7) only considers spatial data and ignores the target's motion in the temporal dimension, increasing the risk of missed detections or false alarms in some complex infrared scenes.Moreover, compared with the matrix-based methods, in the tensor domain, we can explore the intrinsic relationships of the data from multiple perspectives and improve computational efficiency.
To ensure the comprehensive utilization of spatial and temporal information, we adopt the approach in [38] to construct spatial-temporal image tensor.As shown in Figure 1, the input image tensor  ∈   1 × 2 × is constructed by stacking consecutive  frames in chronological order from the infrared sequence.Therefore, Equation ( 7) is written as follows: where , , , and  are the spatial-temporal tensor forms of   ,   ,   , and   , respectively.Figure 3 shows that the singular value distribution curves of the image tensor along mode 1, mode 2, and mode 3 rapidly decrease to zero.This indicates that background tensor  is a low-rank tensor.Given that the infrared small targets usually occupy only a few pixels in the entire image, it can be assumed that  is a sparse tensor.At the same time, it is commonly assumed that the noise in infrared images is additive Gaussian noise that satisfies ‖‖  < .Therefore, the mathematical formula is as follows: where  1 is a positive regularization parameter balancing the trade-off between the target spatial-temporal tensor and the background spatial-temporal tensor.As the optimization of  0 -norm is NP-hard.In practice, it is usually substituted with the  1 -norm:

Self-Adaptive Local Prior Information
In infrared images, the strong edges and corner points in the background exhibit sparsity similar to that of the target.This makes it difficult to completely distinguish them from the target when relying solely on global sparse features.Thus, it is essential to extract local prior and incorporate it into the optimization function to reduce background residuals.For this reason, the structure tensor [48] was used to depict the local geometry structure of infrared images.For an original infrared image , the classic linear structure tensor can be calculated as follows: where   denotes the Gaussian kernel function with variance , * denotes the convolution operation, ∇ denotes the gradient, and ⨂ denotes the Kronecker product.The difference between 1 and 2 reflects the image area to which the pixel belongs.When the pixel belongs to the flat region, 1 ≈ 2 ≈ 0; when the pixel belongs to the corner region, 1 ≥ 2 ≫ 0; and when the pixel belongs to the edge region, 1 ≫ 2 ≈ 0. The local prior information extracted in RIPT [30] is calculated as follows: where (, ) represents the pixel position.However, as shown in row 3 of Figure 4, RIPT only captures the edge structure information of the background, which may result in background residuals and the loss of targets.Brown et al. [49] proposed the following cornerstrength function to highlight target information: where (•) means the structure tensor, and det(•) and tr(•) represent the determinant and trace of the matrix, respectively.The PSTNN model [31] utilizes the maximum eigenvalue as the background weight function, rather than Equation ( 13), and combines it with Equation ( 14) to calculate the prior weight: Row 4 of Figure 4 shows that the PSTNN suppresses the residual edge effect to some extent, but there is still room for improvement.In MFSTPT model [40], a weighted geometric average strategy was developed to integrate edge weighting from Equation ( 13) and the corner-point weighting from Equation ( 14), which can be expressed as follows: However, as demonstrated in row 5 of Figure 4, strong edges are still not constrained effectively, despite the improved acquisition of target information.Based on the above analyses, we believe that the previous methods do not fully exploit the pixel information contained in 1 and 2.Another problem is that the weight-stretching parameter artificially set in RIPT and MFSTPT cannot effectively balance the enhancement of the target and the suppression of the background.The underlying reason is the lack of consideration of the clutter information content in infrared images across different scenes.Therefore, a self-adaptive local prior method is proposed to address the above issues.Inspired by the Frangi filter [50], we utilize the ratio and the difference of eigenvalues to highlight target information and suppress background interference: where (, ) represents the statistical measure of edges and corner points, and |1| > |2|.In edge and corner-point regions, a larger difference between the eigenvalues results in a higher -value.In contrast, in flat regions, the similarity between the two eigenvalues leads to a lower -value.In addition, the -value reflects the level of background interference contained in the original image.In scenes with strong clutter, the -value is larger.
Instead, as the background clutter decreases, the -value will also be smaller.This can be used to adaptively suppress edges and corners.Thus, the final self-adaptive prior weight is described as: where  denotes the half of the maximum of √1 2 + 2 2 .The last row in Figure 4 shows that the proposed self-adaptive weight effectively suppresses the residual effect of strong edges and bright corner points, while also highlighting the target information.It can be seen that the adaptive factor  enhances suppression in scenes with strong clutter, resulting in a slight target shrinkage, but significantly reduces background residuals compared to other methods.Then, we construct the spatial-temporal tensor   and normalize it as follows: =   −     −   (19) where   and   denote the maximum and minimum values of   , respectively.
In order to accelerate the convergence speed and improve the computational efficiency, we use the reweighted scheme [51] to add a sparse weight: where  denotes a non-negative constant,  represents a small positive number preventing the denominator from being 0, and  is the number of iterations.Considering that self-adaptive prior weight in Equation ( 19) can suppress edges and corner points, we obtain   by taking the reciprocal of the corresponding elements in   .Combined with the sparse weight in Equation ( 20), we build the final local prior tensor as follows: where ⊙ represents the Hadamard product.

Spatial-Temporal Total Variation Regularization
In real-world infrared scenes, heavy noise can be a significant interference, causing false alarms in target detection.Fortunately, the TV model effectively reduces image noise while simultaneously preserving the spatial piecewise smoothness.Introduced by Rudin [52], TV regularization can distinguish between areas with significant variations, such as edges and textures, and smooth areas with large amounts of noise.For the matrix  ∈   1 × 2 , the TV norm can be mathematically expressed as follows: It can be seen from Equation ( 22) that the matrix-based TV framework only depicts the spatial continuity of the infrared targets and ignores the temporal continuity between successive frames.For the exploration of temporal coherence and spatio-temporal smoothing of small targets, the STTV can be obtained: where  ℎ ,   , and   represent the horizontal, vertical, and temporal difference operators, respectively.This spatiotemporal form of TV can be seen as an effective regular item, and it exhibits a degree of resilience against noise while preserving the image details.Furthermore, it not only emphasizes the spatial smoothness of the local region in the image but also considers that the target remains temporally consistent among successive frames.

The Proposed TTALP-TV Model
In tensor robust principal component analysis (TRPCA) problems, the rank function is a nonconvex objective to solve.Therefore, the approximation of low-rank background tensor  in Formula ( 10) is a crucial issue.A recent study [53] shows that employing the tensor tree-based TRPCA method can measure low-rankness of each mode and reduce memory requirements.In this article, we leverage the advantages of tensor tree rank and present the following optimization model: where rank tree () = � 1 , ⋯ ,   �  is the tensor tree rank and the weighting vector   meets ∑    =1 = 1.The direct minimization of tensor tree ranks in Formula ( 27) is NPhard.As such, we can use their matrix nuclear norms as convex surrogates: Furthermore, we incorporate the local prior tensor and STTV regularization to obtain the prior information and suppress background noise, respectively.The proposed TTALP-TV model is as follows: where  2 is a positive regularization parameter.

Optimization Procedure
The objective function ( 29) can be solved effectively using the ADMM [54] method.By introducing four auxiliary variables,  ,  1 ,  2 , and  3 , we obtain the following model: Based on the inexact augmented Lagrangian multiplier (IALM) [55], Equation ( 30) is written as: where  1 ,  2 ,  3 ,  4 , and  5 are the Lagrangian multipliers, and  represents a positive penalty parameter.Using the ADMM framework, we can divide the Equation ( 31) into the following subproblems: , Equation (32) can be rewritten as: For each node   ⊂  , the solution of  [] can be obtained by the singular value thresholding (SVT) [56]: where sth  () = sgn()max (|| − , 0) and H denotes the complex conjugate.According to the tensor tree structure of ,   can be used to represent the updated node value instead of directly updating  [𝑞𝑞] .Moreover, after updating the two successor nodes   1 ,   2 ⊂   , we can update the transfer tensor   to represent   for each interior node   ⊂  , where   is obtained by applying SVT to the new tensor  =  ×  1   1 ×  2   2 .In summary, we can utilize tensor tree decomposition to update , and the solution details are summarized in Algorithm 1.

Algorithm 1:
The updating of  from leaves to roots.
end for  can be constructed from   and   Output:  (b) Updating  with other variables being fixed: The closed form solution of Equation ( 35) is expressed as follows: where Using the element-wise shrinkage approach [57],  is updated by: where TH(•) denotes the element-wise shrinkage operator.
The complete process of the ADMM optimization method is given in Algorithm 2.

End while
Output: Background component  and target component .

Steps of Detection Method
Figure 1 elaborates the whole process of the proposed TTALP-TV model, which is described as follows: 1. Self-adaptive local prior extraction.Given an infrared image, the self-adaptive prior weight   is calculated by Equation ( 18). 2. Spatial-temporal tensor construction.The spatial-temporal infrared tensor  ∈   1 × 2 × and local prior tensor  ∈   1 × 2 × are constructed by stacking consecutive  frames in chronological order from the original image sequence and the prior weight map, respectively.
3. Background and target separation.The spatial-temporal infrared tensor  is decomposed into background tensor  and target tensor  through Algorithm 2.
Image reconstruction.Contrary to the construction process, the target image   is reconstructed from .

Experimental Results
In this section, we first discuss the datasets used in infrared target detection experiments.Then, we introduce evaluation metrics and analyze the effects of several important parameters on the TTALP-TV model.Finally, we evaluate the detection ability and robustness of the proposed algorithm and compare it with eight state-of-the-art methods in six complex scenes.

Experiment Data
The dataset used in the experiments consists of six infrared image sequences, including complex scenes such as sky, sea, clouds, mountains, and buildings.The infrared sequences 1, 3, 4, and 6 are from [58,59].In order to carry out an objective assessment of TTALP-TV from diverse scenes, we simulated infrared sequences 2 and 5 using the strategy in [22].As shown in Figure 5, the images are uniformly scaled to the same size to improve target visibility.Meanwhile, each small target is marked by a red rectangle and magnified in the bottom right corner of the image.It can be seen that, in most scenes, the targets occupy a few pixels and lack shape information and texture features.Due to heavy clutter interference in complex scenes, it is difficult to distinguish the target from the background.The specific descriptions of sequences are presented in Table 2. Additionally, the entire experiment framework was implemented using MATLAB R2020a in Windows 10 based on AMD Ryzen 7 5800H 3.20 GHz CPU with 16GB memory.

Evaluation Metrics and Baselines
We evaluate the detection performance of the TTALP-TV method using three evaluation metrics: 3D receiver operating characteristic (3D ROC) [60], signal-to-clutter ratio gain (SCRG), and background suppression factor (BSF).The 3D ROC curve consists of three parameters, including false alarm rate   , detection probability   , and threshold .The   evaluates the target detection capability, while the   assesses the background suppression capability, as defined below: where TD and AT denote the number of detected targets and the number of actual targets, respectively.
where FD and NP denote the number of false detections and the number of image pixels, Due to the intersections between ROC curves, we calculate the AUC values of three 2D ROC curves, AUC (  ,   ) , AUC (,   ) , and AUC (,   ) , for a more accurate performance assessment.The values of AUC (  ,   ) and AUC (,   ) range from zero to one, where values closer to one indicate better target detection capability.Conversely, the value of AUC (,   ) ranges from one to zero, where the value closer to zero represents a better ability to suppress background clutter.Therefore, the above three AUC values are combined to comprehensively evaluate the overall accuracy (OA) and the signal-to-noise probability ratio (SNPR), which are defined as follows: where AUC OA ∈ [0, 2] and AUC SNPR ∈ [0, +∞] .Meanwhile, higher AUC OA and AUC SNPR denote a stronger ability to detect targets and suppress background clutter, respectively.
In addition, the SCRG and BSF can also be used to measure an algorithm's ability to enhance the target and suppress the background, respectively.Both SCRG and BSF are calculated in the neighborhood of the target.As shown in Figure 6, if the target size is  × , then ( + 2) × ( + 2) denotes the size of the target neighborhood.In the experiments of this paper, we follow [32] to set  = 65.The SCRG represents the SCR of the detection result and the original image, which is expressed as: where SCR reflects the degree of discrimination between the target and the background clutter in the image, which can be calculated as: where ̅ 0 denotes the target's average gray value, ̅ 1 denotes that of the target neighborhood, and σ 1 denotes the gray standard deviation of the target neighborhood.
The BSF can evaluate the background suppression ability, which is defined as follows: where σ out and σ in represent the standard deviations of the target neighborhood in the detection result and the original image, respectively.

Parameter Analyses
The settings of different parameters in the model have a great impact on the detection performance.Therefore, this section aims to explore the appropriate parameters for the TTALP-TV method in sequences 1-6.According to [61], we set  2 = 0.01.Then, we detail the effects of  and  on the detection capability of our proposed method.

Adjacent Frames Number 𝐿𝐿
In the construction of the spatial-temporal tensor, the adjacent frame number  determines the utilization of temporal domain information.In order to investigate the influence of different  values on the detection performance of the TTALP-TV model, we adjust  from 2 to 6 with a step of 1. Figure 7 shows the analysis results of various  values using the 3D ROC.Increasing the  values can incorporate more temporal information, which ensures the low-rankness of the spatial-temporal tensor.At the same time, overlarge adjacent frame numbers will lead to redundant and repetitive information, resulting in high false alarms.Figure 7 shows that  = 3 is the most suitable for the proposed model.

Tuning Parameter 𝐻𝐻
The compromising parameter  1 controls the balance between the sparse target and the low-rank background in the framework.Following [62], we set  1 =  ∕ �( 1 ,  2 ) * , where  is a crucial tuning parameter.We change  from 4 to 12 with a step of 2. The 3D ROC analysis results of  are shown in Figure 8.It can be seen from Figure 8 that when  values increase, the false alarms decrease, which indicates that  assists in the suppression of background residuals.Meanwhile, if  is too large (e.g.,  = 12), some necessary information may be lost, resulting in the degradation of detection performance.Based on the 3D ROC analysis results shown in Figure 8, we set  = 10.

Ablation Study
In order to validate the effectiveness of the self-adaptive local prior and STTV regularization in the proposed TTALP-TV method, we conducted an ablation study, as shown in Figure 9.The TTALP-TV framework consists of three parts: the tensor tree-based spatiotemporal tensor model, the self-adaptive local prior tensor, and STTV regularization.As illustrated in Figure 9, we compare the 3D ROC analysis results of four versions of the TTALP-TV method in sequences 1-6:

Noise Robustness Validation of the Proposed TTALP-TV Method
Due to the influence of the real-world environment on the sensor, infrared images usually contain noise.Therefore, it is essential to evaluate the robustness of the TTALP-TV model to noise.To evaluate the noise robustness of TTALP-TV under different noise intensities, Gaussian white noise of  = 5 and  = 15 was added to six scenes, respectively.The second and fourth rows of Figure 9 show the visual detection results of  = 5 and  = 15, respectively.Figure 10 shows that TTALP-TV can accurately detect targets and suppress noise of different intensities, demonstrating its robustness to noisy scenes.

Visual Comparison
Figures 11 and 12 show the detection results of eight compared methods and our method in six infrared sequences.From Figures 11 and 12, we can see that Top-hat has a lot of clutter and noise residuals in its detection results.The main reason for this is that the structure size of the Top-hat is fixed, meaning it cannot adapt to the dynamics of complex scenes.In contrast, TLLCM suppresses clutter to a certain extent but still has background residuals in complex scenes.Compared with the background consistency and HVS methods, the matrix-based LRSD method IPI contains fewer background residuals, but its background is gray.As can be seen from Figures 11 and 12, the PSTNN and ANLPT methods can achieve relatively better target detection performance (e.g., sequences 1, 4, and 6), but they are basically unable to completely suppress background.
At the same time, we can see that NTFRA can better preserve targets and suppress background interference but fails in complex scenes with highlighted line edges (e.g., sequences 3-4).These single-frame detection methods effectively utilize spatial information to separate the target from the background.However, using only inter-frame information results in low robustness to various complex scenes with dynamic changes and heavy clutter.Therefore, many researchers have combined spatial-temporal information to improve detection ability and remove background interference.It can be seen from Figures 11 and 12 that ASTTV-NTLA and NFTDGSTV present exceptional target detection and background suppression abilities in scenes with little clutter interference (e.g., sequence 2).However, when faced with complex scenes with high-brightness clutter and heavy noise (e.g., sequences 3, 4, and 6), their detection performance will degrade significantly.In contrast, the proposed TTALP-TV method is not only able to accurately extract the target and preserve a relatively complete shape, but it can also mostly suppress strong edges and bright corner-point noise in complex scenes.

Quantitative Analysis
In addition to the qualitative analysis in Figures 11 and 12, we adopt 3D ROC, AUC OA , AUC SNPR , SCRG, and BSF, a total of five evaluation metrics, to compare nine methods quantitatively.Figure 13 shows the 3D ROC curves of all comparison methods in complex and noisy scenes (e.g., sequences 1-6).In order to clearly depict the differences among the nine methods, the logarithmic scale is used for the false alarm rate axis.As shown in Figure 13, the proposed TTALP-TV method is closer to the top-right corner, indicating that it has superior detection performance.The single-frame detection method ANLPT also achieves good detection performance in sequence 5.Meanwhile, other sequential-frame detection methods, ASTTV-NTLA and NFTDGSTV, exhibit performance similar to our method in sequence 2 and sequence 6, but were not good enough in the rest of the sequences.To further assess which method has the best performance, we use AUC OA and AUC SNPR to evaluate target detection ability and background suppression ability, respectively.In each sequence, the highest value is highlighted in red, and the second highest value is marked in green.Tables 4 and 5 show that our method achieves the highest AUC OA and AUC SNPR values.In Tables 6 and 7, the SCRG and BSF of nine methods in six sequences are displayed, with the highest and second highest values of SCRG and BSF marked in red and green, respectively.The results show that the ANLPT model achieves the highest SCRG values in sequence 5. On the other hand, the SCRG and BSF values of our model surpass other methods for more complex scenes (e.g., sequences 1-4 and 6).In summary, the above quantitative analyses demonstrate the effectiveness of our algorithm in both target enhancement and background suppression, particularly in complex scenes.In addition to the above evaluation metrics, computational efficiency is also a crucial factor in infrared target detection algorithms.Table 8 presents the average running time of all comparison methods on six sequences (per frame).It should be noted that the image size of sequences 1-4 is 256 × 256, and the image size of sequences 5-6 is 256 × 205 and 296 × 237, respectively.In general, the larger image size results in the longer running time.Based on Table 8, we can find that Top-hat has the shortest time cost.This is because Top-hat adopts a simple model architecture.It is worth noting that tensor-based algorithms are significantly quicker than the matrix-based IPI algorithm.Among the tensorbased methods, the running time of sequential-frame detection methods (e.g., ASTTV-NTLA, NFTDGSTV) is longer than that of single-frame detection methods (e.g., PSTNN, NTFRA, ANLPT).This is mainly because sequential-frame detection methods require more time to process the temporal domain information.From Table 8, it can be seen that the proposed method has a longer running time than ASTTV-NTLA and NFTDGSTV.This is because computing the self-adaptive prior in TTALP-TV increases costs in terms of time.Based on the qualitative and quantitative results shown in  and Tables 4-7, it can be concluded that our method has better detection performance than the compared methods.Therefore, the extra running time of our method is acceptable.

Conclusions
In this article, the TTALP-TV model is proposed for infrared small target detection in complex scenes.Based on the theorem that the tensor tree decomposition can exploit the data structure in a more balanced strategy, we introduce tensor tree rank to obtain more accurate background estimation.It reduces storage costs and retains spatial and temporal correlation through a hierarchical method.In addition, a novel local prior weight is proposed for adaptively assigning weights to targets, which helps to better distinguish targets from similar objects.Meanwhile, STTV is used as a joint regularization term to remove noise while preserving image details.Therefore, the separation of target and background is converted into an optimization problem.Finally, we provide an efficient ADMM-based framework for solving the proposed TTALP-TV model.Extensive experiments demonstrate that the proposed algorithm not only can accurately detect the target but also effectively suppresses background clutter and noise in various complex scenes.However, the real-time performance of our method still needs to be improved due to the prior weight calculation in the model.In the future, our work will focus on establishing more efficient mechanisms to further simplify the calculation and improve detection efficiency.

Figure 1 .
Figure 1.Flowchart of the proposed TTALP-TV model for infrared small target detection.

Figure 2 .
Figure 2. The diagram of tensor tree decomposition.

Figure 3 .
Figure 3. Singular value distribution curves of infrared spatial-temporal tensor along each mode.

Figure 4 .
Figure 4. Comparison of different local structure priors.Row 1 shows original infrared images.Rows 2 to 6 depict different local prior maps, obtained by Equation (13), RIPT, PSTNN, MFSTPT, and the proposed method, respectively.Columns (a-d) display the prior weights extracted using different calculation methods for four infrared image sequences.

Figure 5 .
Figure 5. Representative frames corresponding to the six infrared sequences used in the experiments.

Figure 6 .
Figure 6.Diagram of the target neighborhood.

Figure 7 .
Figure 7. Three-dimensional ROC curves corresponding to different parameters of  in the six sequences.

Figure 8 .
Figure 8. Three-dimensional ROC curves corresponding to different parameters of  in the six sequences.
Figure 9  shows that leveraging the self-adaptive local prior does in fact improve target detection performance to a certain extent.Moreover, the STTV regularization constraint on the background helps better remove background clutter and noise while preserving image details.The results of the ablation experiments intuitively demonstrate the significance of any single module and provide guidance for further attempts to improve the optimization model.

Figure 9 .
Figure 9. Ablation results of the six sequences in 3D ROC curves.

Figure 10 .
Figure 10.Detection results of TTALP-TV under different noise intensities.

Figure 11 .
Figure 11.Detection results of nine methods in sequences 1-3.The red rectangles denote target areas, and the blue ellipses denote noise and background residuals.

Figure 12 .
Figure 12.Detection results of nine methods in sequences 4-6.The red rectangles denote target areas, and the blue ellipses denote noise and background residuals.

Table 2 .
Characteristics of the dataset.

Table 3 .
Detailed parameters of nine methods.

Table 4 .
AUC OA and AUC SNPR of nine methods in sequences 1-3.

Table 5 .
AUC OA and AUC SNPR of nine methods in sequences 4-6.

Table 8 .
Running time(s) of the nine methods.