Thumbnail Tensor—A Method for Multidimensional Data Streams Clustering with an Efficient Tensor Subspace Model in the Scale-Space †

In this paper an efficient method for signal change detection in multidimensional data streams is proposed. A novel tensor model is suggested for input signal representation and analysis. The model is built from a part of the multidimensional stream by construction of the representing orthogonal tensor subspaces, computed with the higher-order singular value decomposition (HOSVD). Parts of the input data stream from successive time windows are then compared with the model, which is either updated or rebuilt, depending on the result of the proposed statistical inference rule. Due to processing of the input signal tensor in the scale-space, the thumbnail like output is obtained. Because of this, the method is called a thumbnail tensor. The method was experimentally verified on annotated video databases and on real underwater sequences. The results show a significant improvement over other methods both in terms of accuracy as well as in speed of operation time.


Introduction
Change detection in signal streams aims at finding time stamps, which correspond to signal variation as defined by some measures. In the result, the input stream is clustered into chunks that have minimized inter-and maximized intra-scatter, respectively. Moreover, the selected parts of the signal chunks can be used as a summary of the stream or, due to their inter-coherency, they can be efficiently compressed. Applications of such clusterings are ample in various domains related to signal processing. However, especially for the multi-dimensional signals, the task is not an easy one. Signal dimensionality, noise, as well as volume and speed of incoming data mean that the most successive methods are those that are highly adapted to the type of the processed signal. For example, in the case of video stream analysis, the majority of the efficient methods operate with specific image features, such as color, texture, sparse descriptors, etc. [1][2][3]. However, with a growing number of various sensors and measurements there is a need for more general methods that can operate with any type of signals. Such an approach, which relies on signal representation and analysis with tensor algebra, is presented in this paper. The method does not rely on any specific signal features and can operate with a variety of sensor measurements. However, there are not too many annotated multidimensional datasets that allow for reliable quantitative measurements. An exception are the available test video datasets [4]. Therefore, in this work we focus on the evaluation of shot detection in video streams.
The proposed method operates as follows. First, a part of the input multidimensional stream is extracted from the input stream. It is then converted to a tensor representation from which an 1.
Hard cuts-an abrupt change of a content; 2.
Soft cuts-a gradual change of a content; The latter can further be separated into the following two groups: 1.
Fade in/out-a new scene gradually appears or disappears from the current image; 2.
Dissolving-a current shot fades out whereas the incoming one fades in.
A great majority of the video shot detection methods rely on the extraction of specific feature and subsequent data clustering and classification [2,3,14,15]. In this respect, the paper by Asghar et al. [16] contains a survey of video indexing. Also, the methods of temporal video segmentation and keyframe detection are discussed therein. On the other hand, a method for construction of a video abstraction is described in the work by Truong and Venkatesh [17]. An overview of the multi-view video summarization algorithms is presented in the paper by Fu et al. [18]. Efficient video summarization and retrieval tools are also discussed by Valdes and Martinez [19].
An interesting survey on video scene detection can be found in the work by Fabro and Böszörmenyi [20]. They classify the video segmentation methods into seven groups, depending on the low-level features used for the segmentation. The cited groups are as follows: the visual-based, audio-based, text-based, audio-visual, visual-textual, audio-textual and hybrid segmentations.
A unified scheme of shot boundary detection and anchor shot detection in news video story parsing is presented in the paper by Lee et al. [21]. Their method is based on a singular value decomposition, and the Kernel-ART method. On the other hand, DeMenthon et al. presented a video summarization method based on the curve simplification [22]. In this approach, a video sequence is represented as a trajectory curve embedded in a high dimensional feature space. It is then analyzed with the binary curve-splitting algorithm. This way partitioned videos are represented with the tree data structures. On the other hand, Mundur et al. [23] proposed another approach to video summarization. In their method, keyframe based video summarization is computed with Delaunay clustering [24]. A STIMO method storyboard creation from moving videos for the web scenario is proposed in the system by Furini et al. [25]. Their proposed algorithm is based on a fast clustering that selects the most representative video content using color distribution in the Hue-Saturation-Value (HSV) color space, computed on a frame-by-frame basis.
De Avila et al. suggested the already mentioned VSUMM method [11]. Their approach is based on the computation of color histograms from the video frames. These are then clustered with the k-means method. For each cluster, a frame closest to the cluster center is then chosen. This is the keyframe that represents a given slice of a video. De Avila et al. also proposed a method of video static summaries evaluation which is used for method comparison. Also in this paper we follow their proposed evaluation strategy with the help of the user annotations available from the Internet [12].
Color histograms for video summarization have been suggested in the method by Cayllahua-Cahuina et al. [26]. In this approach, 3D histograms of 16 × 16 × 16 bins are calculated directly from the RGB image representation. In the result, 4096 dimensional vectors are obtained. These are further compressed with the PCA method. In the next step, two clustering algorithms are launched. Fuzzy-ART is used for the determination of a number of clusters. After this, Fuzzy C-Means performs frame clustering from the color histogram features. However, using only color information is not enough to obtain the satisfactory results, as shown in [26].
In the papers by Medentzidou and Kotropoulos, video summarization methods, based on shot boundary detection with penalized contrast, are put forth [27]. These approaches also rely on color analysis, however, this time in the HSV color space. The mean of the hue component is used as a main indicator of a change in the video. Then, video segments are extracted and represented with a linear model. As reported, the method obtains results comparable to the VSUMM method by de Avila et al. [11]. The next video summarization method, called VSCAN, was proposed by Mahmoud et al. [28]. In this approach, a modified density-based spatial DBSCAN-like clustering method is used. However, once again, video summarization is based entirely on color and texture feature processing.
The concepts of the data stream analysis, concept drift detection, as well as data classification in data streams are also discussed in this paper. In this respect, the book by Gama [1] or the paper by Krawczyk et al. [3], can be recommended as further introductions to this subject.
On the other hand, a short introduction to the tensor algebra and tensor decomposition methods, used in the proposed methods, is contained in Section 4.

A Framework for Multidimensional Data Stream Clustering
An architecture of the multi-dimensional data stream processing in the proposed tensor framework is depicted in Figure 1. We assume that the input stream may consist of potentially infinite series of L-dimensional signals, which can be represented as tensors. From these, at a given time stamp, a window of a fixed size W is selected from which a model tensor is computed, as will be discussed. Then, each incoming data tensor is checked to fit to this model. If it does, then the model is updated, as will be discussed. Otherwise, the model is rebuilt starting at the current data position, and the whole process is repeated. The concepts of the data stream analysis, concept drift detection, as well as data classification in data streams are also discussed in this paper. In this respect, the book by Gama [1] or the paper by Krawczyk et al. [3], can be recommended as further introductions to this subject.
On the other hand, a short introduction to the tensor algebra and tensor decomposition methods, used in the proposed methods, is contained in Section 4.

A Framework for Multidimensional Data Stream Clustering
An architecture of the multi-dimensional data stream processing in the proposed tensor framework is depicted in Figure 1. We assume that the input stream may consist of potentially infinite series of L-dimensional signals, which can be represented as tensors. From these, at a given time stamp, a window of a fixed size W is selected from which a model tensor is computed, as will be discussed. Then, each incoming data tensor is checked to fit to this model. If it does, then the model is updated, as will be discussed. Otherwise, the model is rebuilt starting at the current data position, and the whole process is repeated. As alluded to previously, a similar scheme based on the best rank-R tensor decomposition has been proposed in our work [13]. However, this model is computationally demanding; The algorithm is iterative and requires tensor decomposition in all its dimensions. Therefore, although our previous method produced good results, its operation time allows for processing of only up to three color frames per second. In this paper we suggest a simpler tensor model, which is based on OTS computation with HOSVD of the reduced input tensors. By contrast with the best rank-R algorithm, OTS requires only one solution to the eigenvalue problem. Because of this and as a result of the proposed representation, it is computed from a symmetric matrix of a small size W × W, as will be discussed. Furthermore, the matrix is smaller due to reduction of the input data. All these, as well as application of the fast eigenvalue computation with an automatic selection of the leading eigenvectors, result in much better accuracy. Also, an order of magnitude speed up was achieved as compared to other tensor-based methods. In the next sections, details of the computational steps in Figure 1 are presented.

Construction of the Orthogonal Tensor Subspace (OTS)-Based Model
Tensors in processing of multi-dimensional data offer many advantages compared to the vectorbased methods. Most importantly, in the tensor domain the neighbor relations among elements are As alluded to previously, a similar scheme based on the best rank-R tensor decomposition has been proposed in our work [13]. However, this model is computationally demanding; The algorithm is iterative and requires tensor decomposition in all its dimensions. Therefore, although our previous method produced good results, its operation time allows for processing of only up to three color frames per second. In this paper we suggest a simpler tensor model, which is based on OTS computation with HOSVD of the reduced input tensors. By contrast with the best rank-R algorithm, OTS requires only one solution to the eigenvalue problem. Because of this and as a result of the proposed representation, it is computed from a symmetric matrix of a small size W × W, as will be discussed. Furthermore, the matrix is smaller due to reduction of the input data. All these, as well as application of the fast eigenvalue computation with an automatic selection of the leading eigenvectors, result in much better accuracy. Also, an order of magnitude speed up was achieved as compared to other tensor-based methods. In the next sections, details of the computational steps in Figure 1 are presented.

Construction of the Orthogonal Tensor Subspace (OTS)-Based Model
Tensors in processing of multi-dimensional data offer many advantages compared to the vector-based methods. Most importantly, in the tensor domain the neighbor relations among elements are retained, whereas a separate dimension of a tensor represents each degree of freedom. Another advantage is that tensor methods can work with any type of signal since no specific features are assumed. In the next sections, we present a short introduction to signal representation and processing with tensor based methods. Also details of our suggested tensor model and its updating scheme are presented. Further information on tensor processing can be referred to in literature [6,[29][30][31][32].

Higher-Order Singular Value Decomposition (HOSVD) for Data Stream Analysis
Since the experimental results presented in this paper are related to tensors composed from the 3D color video frames, with no loss of generality, a further analysis in this section is constrained to the 4D tensors.
Tensor algebra found its place in 20th-century physics as a convenient tool for formulation of the general relativity. In this respect, the distinctive properties of tensors are their transformation rules, which precisely describe change of tensor components in respect to a change of the coordinate system [6]. However, since then other definitions of tensors have been formulated-as the multi-linear maps and as the multidimensional arrays of real numbers, respectively. With these new interpretations, tensors found broader applications in psychometrics, chemometrics, data science, geophysics, mechanics analysis, as well as in computer vision, pattern recognition and graphics, just to name a few [6,29,[31][32][33]. We also follow this newer interpretation of tensors. Therefore, it is assumed that a 4D tensor can be represented as a four-dimensional array of real values, that is (tensors are written with calligraphic letters, while for matrices and vectors the bold font is used).
where N j stands for a j-th dimension, for 1 ≤ j ≤ 4. On the other hand, with no loss of information, each tensor can be unanimously represented in a matrix representation. Such representation, known as a tensor flattening, will be extensively used in the algorithms presented in further part of this paper. In this approach, a flattening in the j-th dimension is defined as the following matrix: In other words, tensor flattening is obtained from a tensor T by selecting its j-th dimension, to become a row dimension of the matrix T (j) . On the other hand, a product of all other indices constitutes a column dimension of the matrix T (j) .
In further derivations the k-th modal product of a tensor T ∈ N 1 ×...×N 4 and a matrix M ∈ Q×N k is employed. The result of this product is a tensor S ∈ N 1 ×...N k−1 ×Q×N k+1 ×...N 4 , defined as follows: However, what is really interesting from the analytical point of view, are tensor decompositions. In this paper the HOSVD decomposition will be used. Namely, considering the tensor properties (1)-(3), the HOSVD decomposition of a tensor T ∈ N 1 ×N 2 ×N 3 ×N 4 is defined as follows [6,[34][35][36][37]: where S k stands for a unitary mode matrix of dimensions N k × N k , and Z ∈ N 1 ×N 2 ×N 3 ×N 4 is a core tensor {XE "tensor:core"} of the same dimensions as T . Furthermore, it can be shown that the core tensor Z fulfills the following properties: 1. Two sub-tensors Z n k =a and Z n k =b , obtained by fixing the n k index to a, or b, are orthogonal, that is, for all possible values of k for which a b the following holds: 2.
All sub-tensors can be ordered according to their Frobenius norms: In the framework put forth in [13], and also in this paper, the input tensor T is composed of a series of 3D frame-tensors F w , for 1 ≤ w ≤ W. That is, the input tensor is constructed as follows: However, in this work size of the input tensor is reduced by the two methods:

1.
Randomization, by means of a random selection of rows and columns. This is based on the Mersenne uniform twister in order to achieve tensor of given lower dimensions. As shown in recent works by Halko et al. [5], as well as by Zhou et al. [38], such randomization simplifies tensor processing and, even more importantly, allows for the discovery of the low-rank structure in huge tensors.

2.
The scale-space approach in which the input signal is low-pass filtered and down-sampled to the given lower dimensions. Such a strategy has been applied in the famous SIFT detector [39] or for object detection in e.g. our previous works [14,40].
In effect, a reduced version of the input tensor is obtained, as follows: where ... F i denotes either reduced version of the original tensor F i . This process is depicted in Figure 2a. It should be noted however, that in experiments on video signals the second method, that is scale-space decimation, provided better results by 1-4%, compared to the tensor randomization. This can be caused by natural characteristics of the visual signals in which neighboring pixels are highly correlated. Therefore, in further considerations we refer to the second method of tensor reduction. However, the randomized version can be a beneficiary in the cases with not so strong correlation of neighboring elements [5,38]. Moreover, in our experiments the same decimation coefficient D, in order 0.2-1.0, was used to uniformly reduce tensor size in all dimensions.
Subsequent construction of the tensor T is depicted in Figure 2b. T can be now decomposed with the HOSVD decomposition to build an OTS. OTS serves as a model to the reduced tensor window W. In this respect we follow the idea of Savas and Eldén, originally presented in [41], then followed in [7]. It is easy to notice, that the simple re-arrangement of (4) yields: where owning to the condition (5), tensors B w for 1 ≤ w ≤ W, are orthogonal. In other words, their product gives 0. Equation (9) can be further written as follows: where vectors s w 4 denote columns of the unitary matrix S 4 . Because each tensor B w is three-dimensional, then × 4 denotes the outer product of each 3D tensor and a vector s w 4 , as defined in (3).
2. The scale-space approach in which the input signal is low-pass filtered and down-sampled to the given lower dimensions. Such a strategy has been applied in the famous SIFT detector [39] or for object detection in e.g. our previous works [14,40]. In effect, a reduced version of the input tensor is obtained, as follows: where  i  denotes either reduced version of the original tensor i  . This process is depicted in Figure 2a. It should be noted however, that in experiments on video signals the second method, that is scale-space decimation, provided better results by 1-4%, compared to the tensor randomization. This can be caused by natural characteristics of the visual signals in which neighboring pixels are highly correlated. Therefore, in further considerations we refer to the second method of tensor reduction. However, the randomized version can be a beneficiary in the cases with not so strong correlation of neighboring elements [5,38]. Moreover, in our experiments the same decimation coefficient D, in order 0.2-1.0, was used to uniformly reduce tensor size in all dimensions. Subsequent construction of the tensor  is depicted in Figure 2b.  can be now decomposed with the HOSVD decomposition to build an OTS. OTS serves as a model to the reduced tensor window W. In this respect we follow the idea of Savas and Eldén, originally presented in [41], then followed in [7]. It is easy to notice, that the simple re-arrangement of (4) yields: The input stream consists of potentially infinite series of D-dimensional tensors. This is reduced in all (or selected) dimensions either by the uniform randomization or by the low-pass filtering to the coarser scale-space level (a). From these, a window of a fixed size W is selected from which a model tensor is constructed. The orthogonal tensor subspace (OTS) model is computed from one tensor flattening alongside its last dimension. It is easy to observe that each data from the series constitutes one row in this flattening. The order of flattening is irrelevant if kept consistent among all tensors (b). This way, from the tensor T the OTS is constructed, as visualized in Figure 3. In other words, the OTS constitutes a model. The reason of constructing the OTS is its ability to represent the series of W frames, as well as to introduce a distance of an incoming data (tensor) to that space. This property will be used to check a fitness measure of each frame to the model, as will be discussed.
where owning to the condition (5), tensors w  for 1 ≤ w ≤ W, are orthogonal. In other words, their product gives 0. Equation (9) can be further written as follows: where vectors 4 w s denote columns of the unitary matrix S4. Because each tensor w  is threedimensional, then ×4 denotes the outer product of each 3D tensor and a vector 4 w s , as defined in (3).
This way, from the tensor  the OTS is constructed, as visualized in Figure 3. In other words, the OTS constitutes a model. The reason of constructing the OTS is its ability to represent the series of W frames, as well as to introduce a distance of an incoming data (tensor) to that space. This property will be used to check a fitness measure of each frame to the model, as will be discussed.
Tensors used in the above formula need to be normalized. Value of (11) will be used to assess model consistency. That is, values of R are computed for all frames belonging to the model. Then R is computed for each new frame to tell its consistence with the model, as will be discussed.  Tensors used in the above formula need to be normalized. Value of (11) will be used to assess model consistency. That is, values of R are computed for all frames belonging to the model. Then R is computed for each new frame to tell its consistence with the model, as will be discussed.

Efficient Computation of the Orthogonal Tensor Subspaces
Since efficient computation of the base tensors B w from the input signal is essential for operation of the method, in this section we propose an effective algorithm. Practically, B w can be simply computed after rearrangement of Equation (7), as follows [7,42]: Thus, to compute B w it is sufficient to compute only the mode matrix S 4 . It can be computed from the SVD decomposition of the flattened matrix T (4) , that is: However, T (4) is large, with a number of rows equal to W and the number of columns being the product of its dimensions 1-3. For example, for a color video this is a total number of pixels in the input frames times three color channels. To overcome this problem, both sides of (13) can be multiplied by T T (4) , to obtain the following: The above product T (4) T T (4) has dimensions of only W × W. Moreover, it is a symmetrical matrix. Owning to these properties, an effective fixed-point eigenvalue decomposition algorithm can used, as is described in the next section. What is also important is that the OTS model is computed only once and from a smaller matrix due to tensor processing in the scale-space (or randomization), as well as to the condition (14). Because of this, computation of the base tensors can be much faster compared with other tensor decomposition schemes. The above steps of the model building procedure are shown in Algorithm 1.

Input:
A finite partition of the multi-dimensional data stream from a window W; Output: An orthogonal tensor subspace (OTS) represented with the base tensors B w ; 1.
Fill the buffer with W input data and construct scale-space/randomized tensor T in (8); 2.
Construct the flattened matrix T (4) of a tensor T ;
From (12) compute the bases B w ;

Model Fitness Measure and Efficient Model Updating Scheme
As already pointed out, the measure R in (11) can be used to tell a distance of a tensor ... F to the OTS model represented by the basis {B w }. Values of R for the model frames, as well as for all other frames from the stream can be used for the statistical analysis of abrupt signal changes in the stream. However, instead of the absolute values of R, better results are obtained when the differences of ∆R are used for computations. That is, the following error function is defined as: For proper detection of the shots with slowly changing content, the following drift measure is proposed [13]: where a is a multiplicative factor (3.0-4.0) and b is an additive component (0.2-2.5). The parameters R ∆ and σ ∆ denote the mean and standard deviation computed from the differences of fit values in (15) for the model frames from (7), as follows: Each new tensor ... F is checked to fit to the model in accordance with (17). If it does not fit, the model has to be rebuilt from a new set of frames, starting at the position of ... F . However, to achieve robustness against some spurious signals, in the fitness algorithm it is required that a number G of consecutive frames divert from the model in order to start the model rebuild process.
On the other hand, if ... F fits to the model, the model needs only to be updated. Figure 4 depicts the proposed efficient method of updating of the flattened version T (4) of the model tensor. This is done by simple insertion of only the new row and obliterating the oldest row in the T (4) matrix. Consequently, in the product matrix T (4) T T (4) all values except one row and one column can be reused, as shown in Figure 5. The following steps describe the model update algorithm (Algorithm 2).
Shift data in T (4) by one row up ( Figure 4) and fill the last row with flattened version of ... robustness against some spurious signals, in the fitness algorithm it is required that a number G of consecutive frames divert from the model in order to start the model rebuild process.
On the other hand, if   fits to the model, the model needs only to be updated. Figure    In the above model updating algorithm the most time-consuming is step 3, which involves W-1 products of the tensor frames. On the other hand, the last step 4 is relatively fast and consumes the same amount of time as in the full model build step, since it requires a solution of the eigenvalue problem of a matrix of size W × W.

Efficient Computation of the Leading Eigenvectors
The computation steps in formula (14) involve only the positive real symmetric matrices. Therefore, it is possible to employ a faster algorithm than the general SVD decomposition. For this purpose, the so called fixed-point eigen-decomposition algorithm is proposed [42,43]. Algorithm 3 shows the key steps of this method.

Input:
A real symmetric matrix P;

12.
Set i ← i + 1 As shown in our previous work, application of this fast algorithm allows for even five times speed up. Further properties of this algorithm are discussed in the following publications [13,42,44]. In the above model updating algorithm the most time-consuming is step 3, which involves W-1 products of the tensor frames. On the other hand, the last step 4 is relatively fast and consumes the same amount of time as in the full model build step, since it requires a solution of the eigenvalue problem of a matrix of size W × W.

Efficient Computation of the Leading Eigenvectors
The computation steps in formula (14) involve only the positive real symmetric matrices. Therefore, it is possible to employ a faster algorithm than the general SVD decomposition. For this purpose, the so called fixed-point eigen-decomposition algorithm is proposed [42,43]. Algorithm 3 shows the key steps of this method.

Input:
A real symmetric matrix P; A number K of expected eigenvectors: 1 ≤ K ≤ rows(P); A maximal number of iterations i max ; An orthogonality threshold ε;

Output:
K leading eigenvectors of P;
while err > ε and i < i max 7. e (i) As shown in our previous work, application of this fast algorithm allows for even five times speed up. Further properties of this algorithm are discussed in the following publications [13,42,44].

Computation of the Leading Eigenvectors
The main idea behind finding the number of significant eigenvalues in decomposition (14) is based on the detection of a significant drop in the ordered series of eigenvalues λ i . This corresponds to an observation that the large eigenvalues correspond to variances of the latent variables, whereas small eigenvalues are usually due to noise. For this purpose a difference between logarithms of the largest one λ 1 and λ i and are used. In the first step, the reference slope R M of M−1 initial, that is, the largest eigenvalues λ i is computed as follows: In the next step, each newly computed eigenvalue is compared to the R M , as follows: If d in (19) reaches 1, then the process of computing eigenvectors is immediately stopped. The above procedure of incremental computation is controlled by two parameters M and η. In our experiments, the best results were obtained for M fixed to 3, while η was set in the range 2-5.

Experimental Results
The method was implemented entirely in C++ in the Microsoft Visual 2017 IDE. For the basic tensor operations the DeRecLib library was employed [45]. The experiments were run on a computer with the Intel®i9-7960X CPU @ 2.80GHz, 64 GB RAM, and in the 64-bit Windows 10.
As already mentioned, the proposed method can work with any type of signal of any finite dimensions, since no specific features are computed. However, it is not easy to find suitable test streams with ground truth annotations. Therefore, and to compare results with other works, for evaluation the VSUMM database was used. It contains 50 color videos from the Open Video Project [12,46]. The video sequences are of resolution 352 × 240 pixels, 30 fps, with duration in the range of 1 to 4 min, encoded in the MPEG-1 [4]. The total number of frames is 57,895. For each video in the VSUMM database there are annotated shots obtained by five human annotators.
For the qualitative evaluation, the parameters proposed by de Avila et al. [11,16], called Comparison of User Summaries were used. These are CUS A = n AU /n U and CUS E =~n AU /n U , where n A denotes a number of matching keyframes from the automatic summary (AS) and the user annotated summary, n AU is the complement of this set (i.e., the frames that were not matched), while n U is a total number of keyframes from the user summary only (US). However, in other works the precision P and recall R, parameters are preferred, since they convey also information on the keyframes present in AS and not present in US or vice versa, as discussed by Mahmoud et al. [28,46]. As a tradeoff of the two, in many works the F measure is also used [28,47]. Such an approach has been also undertaken in our experiments. The above quantities are defined as follows: where n AU is a number of keyframes from the AS that match those from the US, n A is a number of total keyframes from the AS only, while n U from the US, respectively. Figure 6a shows exemplary user selected keyframes from the test sequence "The Voyage of the Lee, segment 05", from the Open Video Database [4]. These are compared with the thumbnails computed by our algorithm, as shown in Figure 6b. Based on such comparisons the average value of the accuracy parameter F was computed, as shown in Table 1. On the other hand, detailed values for each test sequence are provided in Table 2. Figure 7 show plots of the detected thumbnail tensors from the test sequences no 28 (also shown in Figure 6), 38, 48, and 58, respectively.     As shown in Table 1, the proposed method outperforms methods reported by other researchers, except for the VSCAN for which it performs equally well. Interestingly enough, our method outperforms other methods even if run on a monochromatic version of the input videos, as shown in the second row of Table 1. In this case its performance is only slightly worse compared to the case when operating with the fully dimensional signal. This is an interesting feature of the tensor-based methods-they easily scale to various dimensions of the input signal. On the other hand, the other methods, being tuned to specific features, such as color or texture, do not possess this feature and cannot run without the color information. At the same time, the proposed thumbnail tensor method is an order of magnitude faster than other tensor methods and the referenced methods for which timings were available. Detailed computational times of our method are presented in Table 3. The important aspect of the group of tensor based methods is that, they do not put any specific assumptions on a type of the input signal. In other words, no specific statistical properties, nor specific features, are required. However, investigation of the method performance with other types of multi-dimensional signals is left for further research.
As alluded to previously, for size reduction of the input tensor, two methods were tested. The first of them is based on tensor randomization. However, the second one, which is based on construction of the scale-space and operation with the coarse scale, performed better for the color and monochrome video signals. Depending of a video sequence, the difference was in order of 1-4%. This is caused by high correlation of neighboring pixels in the video signals. However, the randomization methods have also high potential, especially when working with other types of signals. The randomized method gain on popularity due to recent achievements in matrix and tensor approximations and big data [5,38]. Investigation of tensor randomization in our framework is left for further research, especially when other types of signals will be available. Table 4 contains values of the important parameters that control operation of the proposed thumbnail tensor method. As shown in Table 1, the proposed method outperforms methods reported by other researchers, except for the VSCAN for which it performs equally well. Interestingly enough, our method outperforms other methods even if run on a monochromatic version of the input videos, as shown in the second row of Table 1. In this case its performance is only slightly worse compared to the case when operating with the fully dimensional signal. This is an interesting feature of the tensor-based methods-they easily scale to various dimensions of the input signal. On the other hand, the other methods, being tuned to specific features, such as color or texture, do not possess this feature and cannot run without the color information. At the same time, the proposed thumbnail tensor method is an order of magnitude faster than other tensor methods and the referenced methods for which timings were available. Detailed computational times of our method are presented in Table 3. Table 3. Average execution time of the tensor-based methods for the color test video sequences.

Method Best rank-R [13] HOSVD [7] HOSVD (This Paper)
Processing time (frames/s) 3 15 160 The important aspect of the group of tensor based methods is that, they do not put any specific assumptions on a type of the input signal. In other words, no specific statistical properties, nor specific features, are required. However, investigation of the method performance with other types of multi-dimensional signals is left for further research.
As alluded to previously, for size reduction of the input tensor, two methods were tested. The first of them is based on tensor randomization. However, the second one, which is based on construction of the scale-space and operation with the coarse scale, performed better for the color and monochrome video signals. Depending of a video sequence, the difference was in order of 1-4%. This is caused by high correlation of neighboring pixels in the video signals. However, the randomization methods have also high potential, especially when working with other types of signals. The randomized method gain on popularity due to recent achievements in matrix and tensor approximations and big data [5,38]. Investigation of tensor randomization in our framework is left for further research, especially when other types of signals will be available. Table 4 contains values of the important parameters that control operation of the proposed thumbnail tensor method. The proposed method was finally applied to another problem of object search in underwater video signals. In this experiment a number of human-made objects, such as a shoe, a trowel, a wheel, etc. where drawn in random places in the artificial lake (Zakrzówek, Poland). Then the footage was acquired which contains these, as well as thousands of natural objects, such as stones, plants, underwater hills, etc. There is a total of 10,845 frames in the video. From these, subjectively representative images with various objects were chosen. After that, the method was run with the same parameters as in the previous experiments. The user selected representative frames, as well as results of our method, are presented in Figure 8a,b, respectively.
It can be noticed that in majority of cases our algorithm also detected the human made objects which protrude from the background. Its overall accuracy in this test was 0.71, with execution speed of 140 frames/s.  The proposed method was finally applied to another problem of object search in underwater video signals. In this experiment a number of human-made objects, such as a shoe, a trowel, a wheel, etc. where drawn in random places in the artificial lake (Zakrzówek, Poland). Then the footage was acquired which contains these, as well as thousands of natural objects, such as stones, plants, underwater hills, etc. There is a total of 10,845 frames in the video. From these, subjectively representative images with various objects were chosen. After that, the method was run with the same parameters as in the previous experiments. The user selected representative frames, as well as results of our method, are presented in Figure 8a and Figure 8b, respectively.
(a) It can be noticed that in majority of cases our algorithm also detected the human made objects which protrude from the background. Its overall accuracy in this test was 0.71, with execution speed of 140 frames/s. Figure 9 shows a plot of the detected scene shots computed by our method in the underwater video sequence Zakrzowek No. 3. In all of the presented experiments accuracy has to be seen in the context of the highly subjective key frame choice by users. Our algorithm does not check if a similar scene was already observed, say a few shots beforehand, so for example the trowel is detected two times but in timely separated shots,  It can be noticed that in majority of cases our algorithm also detected the human made objects which protrude from the background. Its overall accuracy in this test was 0.71, with execution speed of 140 frames/s. Figure 9 shows a plot of the detected scene shots computed by our method in the underwater video sequence Zakrzowek No. 3. In all of the presented experiments accuracy has to be seen in the context of the highly subjective key frame choice by users. Our algorithm does not check if a similar scene was already observed, say a few shots beforehand, so for example the trowel is detected two times but in timely separated shots, In all of the presented experiments accuracy has to be seen in the context of the highly subjective key frame choice by users. Our algorithm does not check if a similar scene was already observed, say a few shots beforehand, so for example the trowel is detected two times but in timely separated shots, that is frame 6663 vs. 9333 as shown in Figure 8b. Interestingly enough, almost all changes caused by the appearance of artificial objects were detected. On the other hand, there were not too many false alarms caused by changes of natural objects. This means that the method reacts well on abrupt changes caused by an 'unusual' object, which frequently is not very big but which is significantly different from the background. Such a feature is specific to human observers-it is much easier to detect something 'unusual' in the context, such as a shoe or a trowel in underwater scenery.

Conclusions
In this paper an improved version of our previous work on the multi-dimensional signal streams clustering is presented. Stream clustering is done by searching for abrupt signal changes, based on the proposed tensor framework and the statistical inference rules. In the tensor framework, a multi-dimensional signal is modeled with the orthogonal tensor subspaces, computed with the HOSVD tensor decomposition. However, the main novelty presented in this paper is tensor size reduction by means of tensor randomization and scale-space filtering. Because of this, an order of magnitude speed up has been achieved as compared with the previous version of this method. Due to much smaller input tensors, the output comes in the form of the so called thumbnail tensors. Other contributions of this paper constitute the two efficient algorithms for tensor model construction and its update, respectively. These are based on the fast eigenvalue computation with the mechanism for an automatic choice of the important eigen-components. The method was tested with the Open Video Database that contains signal shots annotations. In terms of the achieved accuracy, the proposed method outperforms other methods presented in the literature, with only one performing equally well. The method was also tested on a real underwater sequence, in the task of scene change detection to facilitate detection of human made drawn objects. Despite the different physical properties of light propagation in water, our method performs equally well, achieving accuracy in the order of 71% with processing speed of 140 frames/s in software implementation.
Although tested on color video streams, the proposed algorithms are general in the sense that there are not any specific assumptions on dimensionality or type of the input signal. Therefore, the proposed framework can be applied to various types of multidimensional signals [48,49]. Also, the suggested construction of the orthogonal tensor subspaces can be used in many other classification and clustering frameworks which require the comparison of tensors.
Further research will be focused on the connection of the signal-clustering method, proposed in this paper, with an efficient data-compression method based on tensor decompositions, as already outlined in one of our works [10]. The main idea is based on an observation that, due to the high inter coherence of the signal chunks, it is possible to obtain high compression ratio with no significant loss of precision in signal reconstruction. Also, the method needs to be tested with other types of multi-dimensional data streams. However, it is difficult to find annotated databases of multidimensional signals. Therefore, further research will be conducted towards the creation of databases containing annotated synthetic and real examples.