Next Article in Journal
A Multi-Site Anti-Interference Neural Network for ASD Classification
Previous Article in Journal
Entanglement Distillation Optimization Using Fuzzy Relations for Quantum State Tomography
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Extraction of Maximal-Sum Principal Submatrix and Its Applications

College of Science, Beijing Forestry University, Beijing 100083, China
*
Authors to whom correspondence should be addressed.
Algorithms 2023, 16(7), 314; https://doi.org/10.3390/a16070314
Submission received: 26 May 2023 / Revised: 23 June 2023 / Accepted: 24 June 2023 / Published: 26 June 2023
(This article belongs to the Section Combinatorial Optimization, Graph, and Network Algorithms)

Abstract

:
Extracting k -order maximal-sum principal submatrix from an n -order real matrix is a typical combinatorial optimization problem and an NP-hard problem. To improve the computational efficiency of solving this problem, we, in this paper, propose an accelerated algorithm with row-by-row updates, called the fusion row update accelerated algorithm, which works by reducing the number of addition operations for submatrix elements. The new algorithm is applied to accelerate color combination selection and maximize color difference, which improves the readability of data visualization results; it is also applied to accelerate stock investment portfolio selection and minimize correlation degree, which decreases the investment risk in the view of daily return volatility.

1. Introduction

The theory, application, and algorithm on maximal-sum submatrix (MSS) have received a lot of attention recently [1,2]. The MSS problem, also known as the maximum subarray problem [3,4], is to find the submatrix that maximizes the sum of matrix elements in it. The solutions to many application problems, such as solving the rectilinear picture compression [5] and finding a subset of genes that is relatively highly expressed among a subset of patients [1], can be converted to finding the maximal-sum submatrix from a matrix. The rows and columns in a matrix similar to the above two examples represent two different variables with different meanings.
However, there is also a special kind of real matrix from practice applications, whose rows and columns represent the same variable, such as correlation coefficient matrix, covariance matrix, and adjacency matrix. One of these specific cases is selecting the most informative principal submatrix of a pre-specified size from a covariance matrix [6] or the maximum edge-weighted subgraph from a weighted undirected graph [7]. It is necessary to choose the same k rows and k columns for the maximum sum submatrix to have practical significance. Therefore, finding the k-order MSS is transformed into finding the k-order maximal-sum principal submatrix (MSPS).
Although the issue of finding the MSPS is much closer to the MSS problem, there are still many differences between the two issues, including:
(1)
The integer programming models describing the two problems are different;
(2)
The formats of their solutions to the problems are different;
(3)
The purposes of the application problems are different.
Therefore, the algorithm for solving the MSPS issue differs from that for solving the MSS problem.
In fact, extracting the MSPS from a real matrix is a typical combinatorial optimization problem and an NP-hard problem [1]. The challenge lies in the high computational complexity of finding an accurate solution. The total number of searches for the k -order MSPS from an n -order real matrix A is C n k = n k ! / ( k ! ) . When n and k are both relatively large, the computational cost is very large.
To improve the computational efficiency, there are two ways that can be taken into account. One is to reduce the number of comparisons of two submatrices; the other is to reduce the number of addition operations for submatrix elements. We, in this paper, construct a fast algorithm to extract the maximal-sum principal submatrix from a real matrix, called the fusion row update accelerated algorithm, and apply this algorithm to the color combination selection problem to maximize color difference and the stock investment portfolio selection to minimize correlation degree, which decreases the investment risk in the view of the volatility of daily returns.
The remainder of the paper is organized as follows. Section 2 briefly surveys the related works. Section 3 describes the problem and derives related properties. Section 4 develops the algorithm for finding the MSPS. Section 5 conducts several numerical experiments to demonstrate the efficiency and the solution quality of our proposed algorithm. Section 6 presents two applications. Section 7 concludes the paper and lists the limitations of the proposed algorithm.

2. Related Work

We review the relevant literature on three aspects: best submatrix extraction, color combination selection and stock investment portfolio construction.

2.1. Best Submatrix Extraction

There are many applications related to extracting the best k × k submatrix from a matrix A. Due to different purposes of application, the definition of the term “best” also varies. The best submatrix means that it has the maximum volume [8], the largest logarithm of the determinant or the minimal trace of the inverse of the selected principal submatrix [6], the max-algebraic permanent [9], etc.
In many cases, the best submatrix means to find the submatrix that has the maximum sum. The maximal-sum submatrix (MSS) can be a principal submatrix or not a principal submatrix [10,11]. The extracted submatrix can be a rectangle [5,12], which means that both the extracted rows and columns are continuous. The max-sum submatrix can also be defined as a rectangular submatrix, in which both the rows and the columns are non-necessarily contiguous [1,2].
The problems discussed in the existing literature mentioned above are both related and different from the problem we want to solve in this paper, but none of them is completely consistent with the problem we are solving. Similarities and differences of these problems between the mentioned literature and ours are listed in Table 1. Compared to the best k × k submatrix extraction, the submatrix to be extracted is a square matrix, and the calculated submatrix features are different. The characteristics of existing methods for calculation are the maximum volume [8], the largest logarithm of the determinant and the minimal trace of the inverse of the selected principal submatrix, but we focus on the sum of all elements of the submatrix.
Compared to the MSS extraction, the commonality of all methods is to find the optimal solution based on the sum of all elements of the submatrix, but most methods do not specify the size of the submatrix, and we, in this paper, focus on the k-order principal submatrix.

2.2. Color Combination Selection

There are many applications related to the k-order maximal-sum principal submatrix (MSPS). One of them is color combination selection. Constructing an optimal color combination is a critical factor for data visualization [13,14], and a lot of literature paid their attention to it [15,16,17,18,19]. A method to improve the readability of the data after clustering and visualizing is to consider the degree of color difference, which can be measured by the distance of two colors [20].
Our MSPS algorithm can be applied to automatically construct the optimal color combination to improve the readability of visualized data. The optimal color combination here is defined by the largest color difference among all selected colors. The issue that extracts k colors from n different colors to form a color combination with maximum color difference, corresponds to the problem that extracts the k -order maximal-sum principal submatrix from an n-order real symmetric matrix.

2.3. Stock Investment Portfolio Construction

Another application related to the k-order MSPS is the stock portfolio selection. Considering that the rise and fall of stocks are related to uncertain factors such as market behavior, company operations and national policies [21], diversified investment is often adopted to decrease the daily volatility [22]. The daily volatility as the investment risk can be measured by variance or standard deviation of the return; therefore, portfolio selection is proposed in order to maximize the expected return and minimize the risk [23,24]. For this goal, the stock portfolio can be constructed according to the minimum covariance matrix [21] and minimum correlation coefficient matrix of stock returns [25].
Most existing methods described above focus on the investment ratio of each stock in the selected k stocks as a portfolio. We, in this paper, pay attention to the construction of a portfolio that consists of k stocks from n specified stocks, which can be used as a preliminary step for investment ratio assignment issues.

3. Problem Description and Related Property

Different from those best principal submatrix selection problems reviewed in previous sections, our k-order maximal-sum principal submatrix (MSPS) is defined by the maximal sum of all elements in the principal submatrix. The relevant concepts and property related to the proposed MSPS extraction algorithm are summarized as follows.

3.1. Problem Description

The notations in this paper are defined as follows. Vectors are represented as bold lowercase characters x , y , etc. Matrices are represented as uppercase characters A , B , etc. Elements in vectors and matrices, as well as scalars in general, are represented as normal italic characters, x i , m i j , i , etc.
Let an n-order real matrix A = a i j n × n i , j L , where L = { 1 , 2 , 3 , , n } is a sequence number set and · is the cardinal number of a set.
Definition 1. 
Let  I L , I = k ; the submatrix  A I k  consisting of the cross elements of  I  rows and  I  columns contained in the real matrix  A  is called the  k -order principal submatrix of matrix   A .
Definition 2. 
Among all  k -order principal submatrices of   A , the maximal-sum principal submatrix (MSPS)  A I * k  maximizes the sum of the elements.
The problem of finding the MSPS of matrix A can be formulized as an optimization problem:
A I * k = a r g max I L A I k S
where the sum of matrix elements is denoted by · S , i.e.,
A S = i , j L a i j
Let I * be the optimal index set and A I * k be the optimal solution. The optimal value is S * as
S * = A I * k S = max I L A I k S = i , j I * a i j
It is easy to know that matrix A and its transposed matrix A T have the same MSPS and the optimal index set, as well as the optimal solution. If a positive integer k is specified, both the optimal index set and the optimal solution perhaps are not unique, but the optimal value is unique.
Example 1. 
Let A  be a real matrix,  A = 2 1 3 0 1 0 4 2 3 4 2 2 1 2 2 1 . When  k = 2 , there are two optimal solutions  A 1 , 3 2 = 2 3 3 2  and  A 2 , 3 2 = 0 4 4 2 , the corresponding optimal index sets are respectively  I * = 1 , 3  and  I * = 2 , 3 , and the optimal value is  S * = 10 .
Example 2. 
Matrix  A  is a non-negative symmetric matrix of  m -order,
A = m m 1 m 2 1 m 1 m 1 m 2 1 m 2 m 2 m 2 1 1 1 1 1 .
For the matrix, if  k  is specified, the MSPS of k-order is  
A I * k = m m 1 m k + 1 m 1 m 1 m k + 1 m k + 1 m k + 1 m k + 1 m k + 1 .
The optimal index set   I * = 1,2 , , k , and the optimal value is
S * = k 2 m i = 1 k k i 2 k i + 1 = k 2 m k ( k 1 ) ( 4 k + 1 ) 6 .
Remark. 
The matrix  A  in this example is related to the optimal layout of the stacked graph for visualizing multi-dimensional time series data [13].

3.2. MSPS Problem Optimization Model

The MSPS problem of a real matrix A can be expressed as a 0–1 integer programming problem as follows:
max S = x T A x s . t . x 1 + x 2 + + x n = k x i { 0,1 } ,
where x = x 1 , x 2 , , x n , x i { 0,1 } . The components with an element value of 1 correspond to the optimal indicator set I * . The objective function value S is the sum of the elements of the MSPS, i.e., A I * k S .
If n and k are small, the branch and bound method can be employed to solve the integer programming problem (4). However, in the era of big data, n and k in practice are often very large. Therefore, it is necessary to design some efficient and accurate algorithms to solve this issue.

3.3. Property

We observed that when traversing each principal submatrix, two adjacent principal submatrices only differ by one row and one column. Using this feature, a formula for accelerating the calculation of the sum of matrix elements can be derived.
Property 1. 
Let two k order matrices
A i 1 , i 2 , , i k = a i 1 , i 1 a i 1 , i k 1 a i 1 , i k a i k 1 , i 1 a i k 1 , i k 1 a i k 1 , i k a i k , i 1 a i k , i k 1 a i k , i k ,
and
A i 1 , i 2 , , i k 1 , i k + 1 = a i 1 , i 1 a i 1 , i k 1 a i 1 , i k + 1 a i k 1 , i 1 a i k 1 , i k 1 a i k 1 , i k + 1 a i k + 1 , i 1 a i k + 1 , i k 1 a i k + 1 , i k + 1
have the same k − 1 order principal submatrix in the upper left corner, then the sum of the elements of two matrices has the following relationship,
S n = S o + Δ r u .
In Equation (5),  S n = A i 1 , i 2 , , i k 1 , i k + 1 S  is the sum of elements in the next  k -order matrix;  S o = A i 1 , i 2 , , i k S  is the sum of elements in the previous k-order matrix;
Δ r u = α i k , · α · , i k , · + a i k , i k + α i k + 1 , · + α · , i k + 1 a i k + 1 , i k + 1 ,
where  α i k , · = j = 1 k a i k , i j  is the sum of all elements in the row  i k  in  A i 1 , i 2 , , i k  and  α · , i k = j = 1 k a i j , i k  is the sum of all elements in the column  i k  in  A i 1 , i 2 , , i k .
It can be seen that both A i 1 , i 2 , , i k 1 , i k and A i 1 , i 2 , , i k 1 , i k + 1 have the same (k − 1)-order submatrix A i 1 , i 2 , , i k 1 . Then, applying A i 1 , i 2 , , i k 1 S to the calculation of the sum of elements in the next matrix A i 1 , i 2 , , i k 1 , i k + 1 , we can avoid repeated addition operations and achieve the goal of improving efficiency.
On the right side of Equation (5), the crossing element a i k , i k appears simultaneously in row i k and column i k , so it must be added back once. Due to the same reason, the crossing element a i k + 1 , i k + 1 has been added twice, then it must be subtracted to obtain the final result S n .
Obviously, summing all elements one by one to calculate S n requires k 2 1 addition operations. However, if using the proposed Formula (5) to calculate S n , the number of times to add or subtract is 4 k 4 .
When k 4 ,
k 2 1 > 4 k 4 .
The inequality (7) means that when k is greater than 4, using Formula (5) to extract A I * k S can reduce the number of additions.

4. The Proposed Algorithm

Integrating Property 1 as a key step, a new method to extract the maximal-sum principal submatrix using row update algorithm (MSPS-RU) is proposed. The algorithm flowchart is shown in Figure 1.
In Figure 1, the input is an n-order real matrix and the order k of the maximal-sum principal submatrix we want to extract. According to the increasing order of columns in I = { i 1 , i 2 , , i k } , all principal submatrices are traversed. Therefore, the condition for the algorithm to terminate is that all principal submatrices have been compared.
If the two matrices to be compared only have one row and one column different, Formula (5) is employed to accelerate the calculation of the sum of matrix elements. Otherwise, all elements should be added one by one using the naive enumeration method. The output is the optimal solution. In the entire algorithm, the reduction in addition is approximately k n k 2 / 2 .
The pseudocode of the algorithm (Algorithm 1) is described as follows.
Algorithm 1: MSPS-RU algorithm
Input: A; //an n -order real matrix A
k ; //the order of the MSPS to be extracted
Output:  I * ; // the optimal index set of the MSPS
1 Let I o = i 1 , i 2 , , i k = { 1,2 , , k } ;
2 build initial principal submatrix A o i 1 , i 2 , , i k , I * = { i 1 , i 2 , , i k } ;
3 calculate the sum S o of all elements in A o , S * = S o ;
4 while ( i 1 n k + 1 )
5   i k = i k + 1 ; I n = i 1 , i 2 , , i k 1 , i k ;
6  build next principal submatrix A n i 1 , i 2 , , i k ;
7  if I n 1 . . k 1 = I o 1 . . k 1
8    S n = S o + Δ r u ; // Formula (5)
9  else
10    S n = A n S = i , j I n a i j ;// Formula (2)
11  endif
12  if S n > S *
13    S * = S n ; I * = { i 1 , i 2 , , i k } ;
14  endif
15   A o = A n ,   I o = I n ;
16 end while
17 return  I * , S * ;

5. Experiment

To evaluate performance of the proposed MSPS-RU algorithm, a naive enumeration algorithm (denoted as Enum) is used as the baseline. The difference between Enum and the MSPS-RU algorithm is that Enum does not have an accelerated summation step, and its pseudocode (Algorithm 2) has been described as follows.
Algorithm 2: Enum algorithm
Input: A; //an n -order real matrix A
k ; //the order of the MSPS to be extracted
Output:  I * ; // the optimal index set of the MSPS
1 Let I o = i 1 , i 2 , , i k = { 1,2 , , k } ;
2 build initial principal submatrix A o i 1 , i 2 , , i k , I * = { i 1 , i 2 , , i k } ;
3 calculate the sum S o of all elements in A o , S * = S o ;
4 while ( i 1 n k + 1 )
5   i k = i k + 1 ; I n = i 1 , i 2 , , i k 1 , i k ;
6  build next principal submatrix A n i 1 , i 2 , , i k ;
7   S n = A n S = i , j I n a i j ;// Formula (2)
8  if S n > S *
9    S * = S n ; I * = { i 1 , i 2 , , i k } ;
10   endif
11    A o = A n ,   I o = I n ;
12 end while
13 return  I * , S * ;
Formula (8) gives the change in time complexity, that is, the number of addition operations is reduced from O k 2 of the Enum method to O k of the proposed MSPS-RU method in each sum of submatrix elements.
We use several experiments to discuss the efficiency of our MSPS-RU algorithm and compare it with the Enum method. The degree of improvement in efficiency adopts a relative index η calculated by
η = T E n u m T o u r s T o u r s × 100 % ,
where T o u r s and T E n u m are the average time taken to solve the issue using the MSPS-RU algorithm and the Enum method, respectively.
All algorithms have been implemented in the C++ programming language. Each run is executed with a single thread on a personal laptop, and the computer configuration is Intel (R) Core (TM) i5-6200U CPU @ 2.30 GHz 2.40 GHz processor and 4GB RAM per run. The runtime under each parameter is the average experiment time (unit: seconds) of 20 repeated experiments. Without loss of generality, considering the computer hardware conditions and time cost, the highest order of the real matrix is no more than 60 in the following experiments.

5.1. Parameter k Is Fixed and Parameter n Is Variable

To explore the effect of parameter n on the calculation time of the MSPS-RU algorithm, the following comparative experiments were conducted.
The n -order real matrix is randomly generated using uniform distribution, and n varies from 20 to 60 with constant step size 10. Let k = 5 and 10, which means extracting the 5-order and 10-order MSPS of the n -order real matrix. The runtime (unit: second) of the Enum algorithm and the MSPS-RU algorithm are compared, and the average runtime of the two algorithms is recorded, as shown in Table 2. From the table, the average runtime of the MSPS-RU method is less than that of the Enum method, and the relative improvement degree η achieves over 20%.
Figure 2 illustrates the comparison of runtime growth trends of extracting 5-order and 10-order MSPS using different algorithms. It also shows that with the increase in n, the time growth rate of our algorithm is lower than that of the Enum algorithm according to the trend lines. Therefore, it indicates that MSPS-RU performs well in solving the MSPS issue of larger-order matrices.

5.2. Parameter n Is Fixed and Parameter k Is Variable

To explore the impact of parameter k on the time efficiency of the MSPS-RU algorithm, a 30-order ( n = 30 ) real matrix is generated randomly according to the uniform distribution, and the k ( k n ) order MSPS is extracted using the Enum method and MSPS-RU algorithm, respectively. The runtime of the two algorithms is listed in Table 3.
From Table 3, we can find that the time consumption of both algorithms first increases and then decreases with the increasement of k. The maximum value of the time consumption is reached when k is 15 or so. The reason, perhaps, is that the number of k -order principal submatrices in a 30-order real matrix is C 30 k , and when k = 15 ,   C 30 15 = max k < n C n k . In addition, the time efficiency of the two methods is not much different when the value k is relatively close to n; resulting from that, the number of running code Line 8 in Algorithm 1 is small.
Table 3 also tells us that our method is significantly more time-efficient than the Enum method when k 20 . We conducted a significance hypothesis test. Assuming: H 0 : μ 1 = μ 2 ,   H 1 : μ 1 μ 2 , ( μ 1 , μ 2 ) is the average time consumption of the two methods. The statistics Z = 2.014 > 1.96 = Z 0.05 , p < 0.05 , then H 1 is accepted at a significance level α = 0.05.
Figure 3 illustrates the comparison of runtime growth trends of extracting k-order MSPS from a 30-order matrix using different algorithms. The trend lines are generated using the moving average method with 2 period. It also shows that with the increase in k, the time growth rate of our algorithm is lower than that of the Enum algorithm according to the trend lines. Therefore, it indicates that MSPS-RU performs better than that of the Enum algorithm.

5.3. Parameters n and k Are Linear Function

In this subsection, we test the time efficiency of different algorithms when both n and k are variables. For convenience, let n = 3 k in this experiment. When n = 9 , 15 , 21 , 27 , 33 , 39 , the corresponding k = 3 , 5 , 7 , 9 , 11 , 13 . Both n and k are arithmetic series. The common difference is 6 and 2, respectively. Using the Enum method and MSPS-RU algorithm to extract the k -order MSPS, the average runtime is listed in Table 4. From the table, we can find that the average runtime of the Enum method is more than that of the MSPS-RU algorithm; when n = 39 , the runtime of the Enum method is nearly 2 times that of the MSPS-RU algorithm, which means that the computational efficiency of the MSPS-RU algorithm is significantly improved compared to the Enum method when both n and k are large numbers.
Figure 4 illustrates the comparison of runtime growth trends of extracting k-order MSPS from a 3k-order matrix using different algorithms. The trend lines are generated using the moving average method with 2 period. From the MA lines, it can be found that as k and n increase, the growth trend of our method’s moving average curve is significantly slower than that of the Enum method, and the efficiency improvement also shows an increasing trend.
In summary, the above three algorithmic efficiency experiments show that the proposed MSPS-RU algorithm works well and has significantly improved efficiency compared to the Enum method.

6. Applications

The extracted MSPS has a lot of applications. Our MSPS algorithm can be used to improve time efficiency of extracting MSPS. This section introduces two applications: (1) automatically constructing an optimal color combination to improve the readability of visualized data; (2) automatically constructing an optimal stock portfolio selection to minimize correlation degree and reduce the investment risk in the view of daily return volatility.

6.1. Construct a Optimal Color Combination

The optimal color combination here is defined by the largest color difference among all selected colors. The issue of extracting k colors from n different colors to form a color combination with maximum color difference corresponds to the problem of extracting a k -order maximal-sum principal submatrix from an n-order real matrix.
The first step is to build a distance matrix of which the element is the distance between two colors. Then, based on this distance matrix, the k -order MSPS is extracted to determine the combination of k colors with maximum difference, resulting in the enhancement of the readability of data visualization naturally.

6.1.1. Construction of Color Distance Matrix

The most commonly used color space in computer systems is RGB color space, whose pixel values are generally obtained through hardware devices such as image acquisition cards and CCD cameras. Throughout the entire color space Ω , all colors can be represented by three channels:
Ω = r , g , b | r , g , b ϵ 0,1 , , 255
Traditionally, red can be represented as ( 255,0 , 0 ) , green can be represented as ( 0,255,0 ) , and blue can be represented as ( 0,0 , 255 ) .
Definition of standard color difference is based on RGB color space. In Ω , let any two colors x i = r i , g i , b i and x j = r j , g j , b j ; the spatial distance d x i , x j between colors x i and x j is defined by the Euclidean distance.
d x i , x j = r i r j 2 + g i g j 2 + b i b j 2
Obviously, d x i , x j = d x j , x i , and we abbreviate d x i , x j as d i j .
Define the distance matrix D between any two colors. Suppose Ω = n , which means that there are n colors used as color candidates. Let i , j L = 1,2 , , n , and the distance matrix between different colors can be constructed as:
D = 0 d 12 d 1 n d 21 0 d 2 n d n 1 d n 2 0
Matrix D is an n -order real symmetric non-negative matrix.
The MSPS-RU algorithm is applied to the n -order distance matrix D to extract the optimal solution A I * ( k ) and the optimal indicator set I * . I * corresponds to the k selected colors. By this way, the selected colors have the maximal color distance d i j . Therefore, these k selected colors can form a color combination with maximal difference and readability.

6.1.2. Application on Stacked Graph Coloring

The experimental data are the precipitation information (unit: mm) of each month in the year (1955) provided by ten urban meteorological stations, namely Huma, Hailar, Nenjiang, Xiguitu Banner Boketu, Keshan, Qiqihar, Hailun, Fujin, Horqin Right Front Banner, and Anda. These are time series data (Figure 5) that can be downloaded from the Institute of Geographic Sciences and Resources, Chinese Academy of Sciences [26].
The 10-line strips with different colors in Figure 5 represent monthly rainfall of 10 cities in a year. In order to observe the accumulated rainfall of all cities and the trend of rainfall in each city more conveniently, the monthly rainfall of 10 cities can be visualized using a stacked graph (also called streamgraphs or ThemeRiver), as shown in Figure 6. In the stacked graph, the color of each strip is a key factor related to the readability of the figure.
Figure 6 shows the stacked graph with the color combination built by 10 colors with maximal difference. These colors are selected from standard RGB color space Ω based on the 10-order MSPS obtained by the MSPS-RU algorithm.
Besides RGB color space Ω , there are many other color spaces, for example, ΔE* color space. In this dissertation [20], the distance between colors has been expounded, and the distance matrix is built using the distances among 18 candidate colors. We extracted 10 colors using our MSPS-RU algorithm from 17 candidate colors, where the white color is removed because it is used for the background color, to visualize the monthly rainfall of 10 cities. The colored stacked graph can be seen in Figure 7.
As a comparison, a color combination based on the Rcolorbrewer function package and user study [13] is used to color the stacked graph, as shown in Figure 8.
Comparing Figure 8 to Figure 6 and Figure 7, the former shows that the color combination built by the rule from reference [13] contains certain aesthetic preferences, and the co-occurrence of colors is relatively suitable and visually comfortable. However, the stacked graphs in Figure 6 and Figure 7, colored using our method, display the best contrast of coloring among the three figures and avoid similar colors appearing in the stacked graph. This observation can also be supported by quantitative data. The average value of the sum of elements of all k-order principal submatrices is 11.2273, but the corresponding values in Figure 6, Figure 7 and Figure 8 are 56.6027, 43.7446 and 38.7163, respectively. These figures indicate that using our method or the rule from [13] can obtain better results, and our method is even more advantageous. Therefore, from the perspective of increasing contrast, our method provides an alternative solution.

6.2. Construct a Optimal Stock Portfolio

Another application related to the k-order MSPS is the stock portfolio selection. An optimal stock portfolio is here defined as the minimum correlation degree of stock returns, which is regarded as a kind of investment risk [25]. The minimum correlation degree means increased diversification of investment items. Our goal is to select a portfolio with the minimum correlation degree from the candidate stocks. The algorithm, named find a portfolio with the minimum correlation degree (FP-MCD), mainly includes the following steps to achieve this goal:
(1)
Building a negative correlation coefficient matrix: Since MSPS-RU deals with the maximal problem, the minimum correlation degree can be converted to the maximal problem of the negative correlation coefficient matrix.
(2)
Using MSPS-RU algorithm to extract k stocks which have the maximal negative correlation degree.
(3)
Proposing investment suggestions.
The pseudocode of the algorithm (Algorithm 3), FP-MCD, has been described as follows.
Algorithm 3: FP-MCD algorithm
Input:  R t × n ; //The return rates of n candidate stocks in continuous t days;
     k ; //the number of stocks in the portfolio
     t 1 ; // t 1 days used for learning, constructing the correlation coefficient matrix; ( t t 1 ) days used to test portfolio returns
Output: I * ; // the optimal index set of the n candidate stocks
1 Let R t × n = P t 1 × n Q ( t t 1 ) × n ;
2 Construct the correlation coefficient matrix T n × n using P t 1 × n ;
3 Build the negative correlation coefficient matrix M n × n = T n × n ;
4 Use MSPS-RU to obtain the optimal portfolio I *
5 Test portfolio returns with I * and data Q ( t t 1 ) × n ;
6 If the I * is accepted
7  return I * ;
8 else
9  return N o t   r e c o m m e n d e d ;
10 endif

6.2.1. Data

We recorded the weekly return rates of 50 stocks in the Shanghai Stock Exchange in the past 3 years from 3 April 2020 to 24 March 2023, a total of 36 months and 153 weeks on the Sohu website (https://q.stock.sohu.com/cn/bk_4507.shtml, accessed on 4 April 2023). During this period, the return data of four companies, China Yangtze Power, Three Gorges Energy, CICC, and Trina Solar, are incomplete, so these four stocks are removed from the candidate stock set. Then, the candidate stock set consists of 46 companies, including Baotou Steel, Sinopec, CITIC Securities, Sany Heavy Industry, China Merchants Bank, Poly Real Estate, SAIC Motor, China Northern Rare Earth, Fosun Pharma, Jiangsu Hengrui Medicine, Wanhua Chemical Group, Hengli Petrochemical, Guodian NARI, Pien tze huang, Tongwei, Kweichow Moutai, Hailuo Cement, Haier Zhijia, Wingtech, Shanxi Fenjiu, Yili, Hangfa Power, Longji Green Energy CSCI, China Shenhua, Industrial Bank, Shaanxi Coal Industry, Agricultural Bank of China, Ping An Insurance, Industrial and Commercial Bank of China, China Pacific Insurance, China Life Insurance, Great Wall Motor, CSCEC, Power Construction Corporation of China, Huatai Securities, PetroChina, China Exemption, Zijin Mining, COSCO Marine Holdings, WuXi AppTec, Hesheng Silicon, Foshan Haitian Flavouring&Food Co, Weier, Huayou Cobalt, Zhaoyi Innovation. A total of 7038 sample data were obtained, as shown in Figure 9.
The correlation coefficient matrix T of the weekly increase in 46 companies is shown in Figure 10.

6.2.2. Construct the Optimal Investment Portfolio

Using the MSPS-RU algorithm, the k -order maximal-sum principal submatrices from the negative correlation coefficient matrix M can be extracted. Table 5 lists different investment portfolios corresponding to different k values.

6.2.3. Test the Optimal Investment Portfolio

The test experiment is conducted using a simple portfolio, k = 3 , without losing generality. When k = 3 , the optimal portfolio consists of the 1st stock Baogang Shares, the 14th stock Pien tze huang, and the 40th stock China Exempt, according to Table 5.
We use the first one-third of the 153 weekly return rates ( t 1 = 51 ) to calculate the correlation coefficient matrix and proceed with investment from the next working week. For convenient comparison, 30 random investment portfolios are also simulated, and each random investment portfolio consists of three stocks randomly selected from 46 stocks. Let the initial investment be 1 unit, and there is no buying or selling transactions during the period. The changes in assets during the investment period of all 30 random investment portfolios and ours are drawn as line charts (Figure 11).
In Figure 11, the red line shows that after 102 weeks, its result is greater than the initial fund 1. This means successfully investing with the optimal investment portfolio recommended by our method. After 102 weeks, the results of the green or blue lines are diversified, which reflects the instability of random investments.
Two indicators are used for quantitative analysis. The first is the overall standard deviation σ defined by Formula (9), which measures the volatility of investment returns [27].
σ = 1 n i = 1 n x i x ¯ 2
where x ¯ is the arithmetic mean, and x i is the daily increase.
Another indicator for quantitative analysis is the final asset quantity, denoted by Ret.
Using “ R n d k” to represent 30 random portfolios, “ O u r s ” to represent the portfolio obtained by the MSPS-UR algorithm, the result after 102 weeks in Figure 11 is listed in Table 6.
For a more direct comparison, we subtract the corresponding indicators, i.e.,
σ = σ R n d   k σ O u r s
R e t = R e t R n d   k R e t O u r s
The comparison of 30 random portfolios and our portfolio is shown in Table 7.
In Table 7, there are eight portfolios that satisfied R e t < 0 and σ > 0 , which means that these portfolios not only have lower final returns than those obtained by the MSPS-RU algorithm, but also have greater asset volatility. There are 15 combinations that satisfied σ < 0 and R e t < 0 , which means that although these portfolios have lower asset volatility, their returns are lower than ours too.

7. Conclusions

We, in this paper, propose a fusion row update accelerated algorithm (MSPS-RU) and present two different applications. The experimental results show that the proposed algorithm avoids repeatedly calculating the sum of the same submatrix and significantly improves computing efficiency compared to the Enum algorithm; the color combination based on MSPS-RU enhances the contrast of the data visualization results; the stock investment portfolio constructed by MSPS-RU can bring higher investment returns but lower risks, avoiding the uncertainty of random investments.
Our method has limitations, and solving these problems can become future research topics. The improvement of computational efficiency in this paper mainly depends on reducing the number of matrix elements to be summed. However, it can also be achieved by reducing the number of principal submatrices to be calculated, for example, using pruning. The selection of color combinations should consider not only the maximizing color difference but also aesthetic habits and preferences. In addition, more application problems can be modeled by extracting the maximum sum principal submatrix from a raw data matrix.

Author Contributions

Conceptualization, L.L. and H.L.; methodology, Y.Z. and H.L.; software, Y.Z. and H.L.; validation, Y.Z., L.L. and H.L.; formal analysis, H.L.; investigation, Y.Z.; resources, Y.Z.; data curation, Y.Z.; writing—original draft preparation, Y.Z., L.L. and H.L.; writing—review and editing, L.L.; visualization, Y.Z. and H.L.; supervision, L.L. and H.L.; project administration, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data can be download from http://www.data.ac.cn/table/tbc40, accessed on 21 May 2023.

Acknowledgments

The author thanks the anonymous referees for their valuable suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Branders, V.; Schaus, P.; Dupont, P. Combinatorial optimization algorithms to mine a sub-matrix of maximal sum. In Proceedings of the 6th International Workshop on New Frontiers in Mining Complex Patterns in Conjunction with ECML-PKDD 2017, Skopje, Macedonia, 18–22 September 2017; Lecture Notes in Computer Science. Springer International Publishing: Berlin/Heidelberg, Germany, 2018; Volume 10785, pp. 65–79. [Google Scholar] [CrossRef]
  2. Derval, G.; Schaus, P. Maximal-Sum submatrix search using a hybrid contraint programming/linear programming approach. Eur. J. Oper. Res. 2022, 297, 853–865. [Google Scholar] [CrossRef]
  3. Ferreira, C.S.; Camargo, R.Y.; Song, S.W. A Parallel Maximum Subarray Algorithm on GPUs. In Proceedings of the 2014 International Symposium on Computer Architecture and High Performance Computing Workshop, Paris, France, 22–24 October 2014; pp. 12–17. [Google Scholar] [CrossRef]
  4. Weddell, S.J.; Read, T.R.; Thaher, M.; Takaoka, T. Maximum subarray algorithms for use in astronomical imaging. J. Electron. Imaging 2013, 22, 043011. [Google Scholar] [CrossRef] [Green Version]
  5. Koch, I.; Marenco, J. The maximum 2D subarray polytope: Facet-inducing inequalities and polyhedral computations. Discret. Appl. Math. 2022, 323, 286–301. [Google Scholar] [CrossRef]
  6. Li, Y.; Xie, W. Best Principal Submatrix Selection for the Maximum Entropy Sampling Problem: Scalable Algorithms and Performance Guarantees. arXiv 2023, arXiv:2001.08537. [Google Scholar] [CrossRef]
  7. Macambira, E.M. An Application of Tabu Search Heuristic for the Maximum Edge-Weighted Subgraph Problem. Ann. Oper. Res. 2002, 117, 175–190. [Google Scholar] [CrossRef]
  8. Massei, S. Some algorithms for maximum volume and cross approximation of symmetric semidefinite matrices. BIT Numer. Math. 2022, 62, 195–220. [Google Scholar] [CrossRef]
  9. Lewis, S.C. On the Best Principal Submatrix Problem. Ph.D. Thesis, University of Birmingham, Birmingham, UK, 2006. [Google Scholar]
  10. Wen, Z. Fast parallel algorithms for the maximum sum problem. Parallel Comput. 1995, 21, 461–466. [Google Scholar] [CrossRef]
  11. Tamaki, H.; Tokuyama, T. Algorithms for the Maximum Subarray Problem Based on Matrix Multiplication. Interdiscip. Inf. Sci. 2000, 6, 99–104. [Google Scholar] [CrossRef] [Green Version]
  12. Takaoka, T. Efficient Algorithms for the Maximum Subarray Problem by Distance Matrix Multiplication. Electron. Notes Theor. Comput. Sci. 2002, 61, 191–200. [Google Scholar] [CrossRef] [Green Version]
  13. He, Y.; Li, H. Optimal layout of stacked graph for visualizing multidimensional financial timeseries data. Inf. Vis. 2022, 21, 63–73. [Google Scholar] [CrossRef]
  14. Healey, C.G. Choosing effective colours for data visualization. In Proceedings of the Seventh Annual IEEE Visualization ’96, San Francisco, CA, USA, 27 October–1 November 1996; pp. 263–270. [Google Scholar] [CrossRef] [Green Version]
  15. Zhou, L.; Hansen, C.D. A Survey of Colormaps in Visualization. IEEE Trans. Vis. Comput. Graph. 2016, 22, 2051–2069. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. El-Assady, M.; Kehlbeck, R.; Metz, Y.; Schlegel, U.; Sevastjanova, R.; Sperrle, F.; Spinner, T. Semantic Color Mapping: A Pipeline for Assigning Meaningful Colors to Text. In Proceedings of the 2022 IEEE 4th Workshop on Visualization Guidelines in Research, Design, and Education (VisGuides), Oklahoma City, OK, USA, 17 October 2022; pp. 16–22. [Google Scholar]
  17. Anderson, C.L.; Robinson, A.C. Affective Congruence in Visualization Design: Influences on Reading Categorical Maps. IEEE Trans. Vis. Comput. Graph. 2022, 28, 2867–2878. [Google Scholar] [CrossRef] [PubMed]
  18. Samsel, F.; Bartram, L.; Bares, A. Art, Affect and Color: Creating Engaging Expressive Scientific Visualization. In Proceedings of the IEEE VISAP, Berlin, Germany, 23–26 October 2018. [Google Scholar]
  19. Wang, Y.; Chen, X.; Ge, T.; Bao, C.; Sedlmair, M.; Fu, C.-W.; Deussen, O.; Chen, B. Optimizing Color Assignment for Perception of Class Separability in Multiclass Scatterplots. IEEE Trans. Vis. Comput. Graph. 2019, 25, 820–829. [Google Scholar] [CrossRef] [PubMed]
  20. Abeyta, R.N. The Distance between Colors; Using DeltaE* to Determine Which Colors Are Compatible. Ph.D. Thesis, Embry-Riddle Aeronautical University, Daytona Beach, FL, USA, 2011. Available online: https://commons.erau.edu/edt/9 (accessed on 23 June 2023).
  21. Ledoit, O.; Wolf, M. Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. J. Empir. Financ. 2003, 10, 603–621. [Google Scholar] [CrossRef] [Green Version]
  22. Onali, E.; Mascia, D.V. Corporate diversification and stock risk: Evidence from a global shock Author links open overlay panel. J. Corp. Financ. 2022, 72, 102150. [Google Scholar] [CrossRef]
  23. Markowitz, H. Portfolio selection. J. Financ. 1952, 7, 77–91. [Google Scholar]
  24. Ruppert, D.; Matteson, D.S. Portfolio Selection. In Springer Texts in Statistics; Springer: Berlin/Heidelberg, Germany, 2015; pp. 465–493. [Google Scholar] [CrossRef]
  25. Tiwari, A.K.; Abakah, E.J.A.; Karikari, N.K.; Hammoudeh, S. Time-varying dependence dynamics between international commodity prices and Australian industry stock returns: A Perspective for portfolio diversification. Energy Econ. 2022, 108, 105891. [Google Scholar] [CrossRef]
  26. Institute of Geographic Sciences and Resources, Chinese Academy of Sciences. Monthly Precipitation over the Years (by Station), Dataset. Available online: http://www.data.ac.cn/table/tbc40 (accessed on 21 May 2023).
  27. Moledina, A.A.; Roe, T.L.; Shane, M. Measuring commodity price volatility and the welfare consequences of eliminating volatility. In Proceedings of the American Agricultural Economics Association Annual Meeting, Denver, CO, USA, 1–4 August 2004. [Google Scholar]
Figure 1. Flowchart of MSPS-RU algorithm.
Figure 1. Flowchart of MSPS-RU algorithm.
Algorithms 16 00314 g001
Figure 2. The comparison of runtime growth trends of extracting 5-order and 10-order MSPS using different algorithms.
Figure 2. The comparison of runtime growth trends of extracting 5-order and 10-order MSPS using different algorithms.
Algorithms 16 00314 g002aAlgorithms 16 00314 g002b
Figure 3. Runtime trends of extracting k -order MSPS in a 30-order real matrix.
Figure 3. Runtime trends of extracting k -order MSPS in a 30-order real matrix.
Algorithms 16 00314 g003
Figure 4. Runtime trend curves of extracting k -order MSPS in a 3k-order matrix.
Figure 4. Runtime trend curves of extracting k -order MSPS in a 3k-order matrix.
Algorithms 16 00314 g004
Figure 5. Monthly rainfall of 10 cities in a year.
Figure 5. Monthly rainfall of 10 cities in a year.
Algorithms 16 00314 g005
Figure 6. Color combination with maximal difference.
Figure 6. Color combination with maximal difference.
Algorithms 16 00314 g006
Figure 7. The stacked graph colored with maximal difference 10 colors extracted from the ΔE* color distance matrix [20].
Figure 7. The stacked graph colored with maximal difference 10 colors extracted from the ΔE* color distance matrix [20].
Algorithms 16 00314 g007
Figure 8. Color combination built by the rule from reference [13].
Figure 8. Color combination built by the rule from reference [13].
Algorithms 16 00314 g008
Figure 9. Stacked chart of weekly return rates of 46 stocks.
Figure 9. Stacked chart of weekly return rates of 46 stocks.
Algorithms 16 00314 g009
Figure 10. The correlation matrix for the weekly increase in 46 stocks.
Figure 10. The correlation matrix for the weekly increase in 46 stocks.
Algorithms 16 00314 g010
Figure 11. The changes in assets during the investment period according to the optimal investment portfolio and the 30 random portfolios.
Figure 11. The changes in assets during the investment period according to the optimal investment portfolio and the 30 random portfolios.
Algorithms 16 00314 g011
Table 1. Similarities and differences of the problems between the mentioned literature and ours.
Table 1. Similarities and differences of the problems between the mentioned literature and ours.
Ref.ObjectConnectivityPrincipal Submatrix or NotSubmatrix OrderAccurate or Approximate Solution
[1]The sum of the entries of submatrixDiscreteNUnfixedAccurate
[2]The sum of the entries of submatrixDiscreteNUnfixedApproximate
[3,4,5]The sum of the entries of submatrixConsecutiveNUnfixedAccurate
[6]The determinant of submatrix +DiscreteYFixedApproximate
[7]The sum of the entries of submatrix ++DiscreteYFixedApproximate
[8]The maximum volume submatrixDiscreteYFixedApproximate
OursThe sum of the entries of submatrixDiscreteYFixedAccurate
+ This is equivalent to the largest logarithm of the determinant of submatrix in [6]; ++ The maximum edge-weighted subgraph problem in [7] is equivalent to the problem of maximal sum of the entries of submatrix.
Table 2. The average runtime of extracting 5-order and 10-order MSPS.
Table 2. The average runtime of extracting 5-order and 10-order MSPS.
nk T E n u m T o u r s η (%) k T E n u m T o u r s η (%)
2050.0020 0.0016 22.36 100.0746 0.0544 37.15
3050.0138 0.0094 47.04 1010.5342 6.1387 71.60
4050.0610 0.0384 59.15 10289.4542 153.5836 88.47
5050.1946 0.1170 66.29 103485.9831 1629.6164 113.91
6050.4980 0.3157 57.71 1026,507.8060 11,480.9314 130.89
Table 3. The runtime of extracting k -order MSPS of a 30-order real matrix.
Table 3. The runtime of extracting k -order MSPS of a 30-order real matrix.
kTEnumToursη (%)
20.00150.000962.76
50.01550.009957.02
81.31050.726980.29
1123.270413.210476.15
14102.959864.439759.78
17143.0534104.177437.32
2089.366077.543115.25
2362.898961.86351.67
2664.763564.9295−0.26
2957.227957.12160.19
Table 4. The average runtime of extracting k -order MSPS from 3k-order matrix.
Table 4. The average runtime of extracting k -order MSPS from 3k-order matrix.
nk T E n u m T o u r s η
930.0011150.00077643.76
1550.0007850.0007751.23
2170.0221760.01558742.27
2791.3623910.81053168.09
331181.1776844.2712983.36
39134693.5592424.0693.62
Table 5. Different investment portfolios corresponding to different k values.
Table 5. Different investment portfolios corresponding to different k values.
kPortfolios
31 14 40
41 14 28 41
51 9 14 28 40
61 9 16 28 40 41
71 9 14 28 30 40 41
81 9 14 28 30 33 40 41
91 9 14 28 30 33 40 41 44
101 9 10 28 30 33 38 40 41 44
Table 6. Return and σ of simulated investments for each portfolio after 102 weeks.
Table 6. Return and σ of simulated investments for each portfolio after 102 weeks.
IndicatorOursRnd1Rnd2Rnd3Rnd4Rnd5Rnd6Rnd7Rnd8Rnd9Rnd10
σ 0.02780.03070.02600.01330.01970.03430.01720.02910.02930.02200.0316
Ret1.03321.01971.03000.98620.95841.02401.02601.00241.00291.05220.9509
IndicatorRnd11Rnd12Rnd13Rnd14Rnd15Rnd16Rnd17Rnd18Rnd19Rnd20
σ 0.02430.02290.02890.01890.02510.01490.03120.03020.02370.0135
Ret0.98661.03661.05641.01641.01851.03531.04510.99610.99501.0168
IndicatorRnd21Rnd22Rnd23Rnd24Rnd25Rnd26Rnd27Rnd28Rnd29Rnd30
σ 0.02600.02060.04100.02870.02500.02420.02060.02640.02040.0169
Ret1.02431.00551.01701.02441.01651.03611.03881.03861.01220.9902
Table 7. The difference in return and σ between Rnd k and Ours.
Table 7. The difference in return and σ between Rnd k and Ours.
Diff.Rnd1Rnd2Rnd3Rnd4Rnd5Rnd6Rnd7Rnd8Rnd9Rnd10
σ 0.0028−0.0019−0.0146−0.00810.0064−0.01060.00130.0015−0.00580.0038
Ret−0.0134−0.0031−0.0470−0.0747−0.0092−0.0072−0.0308−0.0302−0.0190−0.0823
Diff.Rnd11Rnd12Rnd13Rnd14Rnd15Rnd16Rnd17Rnd18Rnd19Rnd20
σ −0.0035−0.00490.0011−0.0089−0.0028−0.01290.00340.0023−0.0042−0.0143
Ret−0.04660.00340.0232−0.0167−0.01460.002150.0119−0.0371−0.0382−0.0163
Diff.Rnd21Rnd22Rnd23Rnd24Rnd25Rnd26Rnd27Rnd28Rnd29Rnd30
σ −0.0018−0.00730.01320.0008−0.0028−0.0036−0.0072−0.0015−0.0074−0.0109
Ret−0.0089−0.0277−0.0162−0.0088−0.01670.00290.00570.0055−0.0210−0.0430
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Y.; Luo, L.; Li, H. The Extraction of Maximal-Sum Principal Submatrix and Its Applications. Algorithms 2023, 16, 314. https://doi.org/10.3390/a16070314

AMA Style

Zhang Y, Luo L, Li H. The Extraction of Maximal-Sum Principal Submatrix and Its Applications. Algorithms. 2023; 16(7):314. https://doi.org/10.3390/a16070314

Chicago/Turabian Style

Zhang, Yizheng, Liuhong Luo, and Hongjun Li. 2023. "The Extraction of Maximal-Sum Principal Submatrix and Its Applications" Algorithms 16, no. 7: 314. https://doi.org/10.3390/a16070314

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop