Efﬁcient Algorithms for the Maximum Sum Problems

: We present efﬁcient sequential and parallel algorithms for the maximum sum (MS) problem, which is to maximize the sum of some shape in the data array. We deal with two MS problems; the maximum subarray (MSA) problem and the maximum convex sum (MCS) problem. In the MSA problem, we ﬁnd a rectangular part within the given data array that maximizes the sum in it. The MCS problem is to ﬁnd a convex shape rather than a rectangular shape that maximizes the sum. Thus, MCS is a generalization of MSA. For the MSA problem, O ( n ) time parallel algorithms are already known on an ( n , n ) 2D array of processors. We improve the communication steps from 2 n − 1 to n , which is optimal. For the MCS problem, we achieve the asymptotic time bound of O ( n ) on an ( n , n ) 2D array of processors. We provide rigorous proofs for the correctness of our parallel algorithm based on Hoare logic and also provide some experimental results of our algorithm that are gathered from the Blue Gene/P super computer. Furthermore, we brieﬂy describe how to compute the actual shape of the maximum convex sum.


Introduction
We face a challenge to process a big amount of data in the age of information explosion and big data [1].As the end of Moore's law comes in sight, however, the extra computing power is unlikely obtained from a single processor computer [2].Parallel computing is obviously a direction to faster computation.However, efficient utilization of the parallel architecture is not an effortless task.New algorithms often need to be designed for specific problems on specific hardware.
In this paper, we investigate the maximum subarray (MSA) problem and the maximum convex sum (MCS) problem, the latter being a generalization of the former.We design efficient parallel algorithms for both problems and an improved sequential algorithm for the MCS problem.
The MSA problem is to find a rectangular subarray in the given two-dimensional (2D) data array that maximizes the sum in it.This problem has wide applications from image processing to data mining.An example application from image processing is to find the most distinct spot, such as brightest or darkest, in the given image.If all pixel values are non-negative, solving the MSA problem will return the trivial solution of the whole array.Thus, we normalize the input image (e.g., by subtracting the mean value), such that the relatively brighter spots will have positive sums, while the relatively darker spots will have negative sums.Then, solving the MSA problem on the normalized image can give us the brightest spot in the image.In data mining, suppose we spread the sales amounts of some product on a two-dimensional array classified by customer ages and annual income.Then, after normalizing the data, the maximum subarray corresponds to the most promising customer range.
We now consider the MCS problem that maximizes the sum in a convex shape.The definition of the word "convex" is not exactly the same as that in geometry.We define the convex shape as the joining of a W-shape and an N-shape, where W stands for widening and N for narrowing.We call this shape the W N-convex shape or W N-shape for short.In W, the top boundary goes up or stays horizontal when scanned from left to right, and the bottom boundary goes down or stays horizontal.Fukuda et al. defines a more general rectilinear shape [3], but for simplicity, we only consider the W N-shape.The paper is mainly devoted to computing the maximum sum, and one section is devoted to the computation of the explicit convex shape that provides the sum.
We give an example to illustrate how solving the MCS problem can provide a much more accurate data analysis compared to solving the MSA problem.We compared the size of the hole in the ozone layer over Antarctica between 2006 and 2011 by solving both the MSA problem and the MCS problem on the normalized input images (Figure 1a,d, respectively) of the Antarctic region.Figure 1b,e is from solving the MSA problem on the input images, while Figure 1c,f is from solving the MCS problem on the same set of input images.Solving the MCS problem clearly provides a much more accurate representation of the ozone hole.The numeric value returned by the MCS is also 22%~24% greater than that by the MSA on both occasions.Intuitively similar gains from the W N-shape compared to the rectangular shape can be foreseen for other types of data analysis.In this paper, we assume that the input data array is an (n, n) square two-dimensional (2D) array.It is a straightforward exercise to extend the contents of this paper to (n, m) rectangular 2D input data arrays.
The typical algorithm for the MSA problem by Bentley [4] takes O(n 3 ) time on a sequential computer.This has been improved to a slightly sub-cubic time bound by Tamaki and Tokuyama [5] and also Takaoka [6].For the MCS problem, an algorithm with O(n 3 ) time is given by Fukuda et al. [3], and an algorithm with a sub-cubic time bound is not known.
Takaoka discussed a parallel implementation to solve the MSA problem on a PRAM [6].Bae and Takaoka implemented a range of parallel algorithms for the MSA problem on an (n, n) 2D mesh array architecture based on the row-wise or column-wise prefix sum [7][8][9].A parallel algorithm for the MSA problem was also implemented on the BSP/CGMarchitecture, which has more local memory and communication capabilities with remote processors [10].
In this paper, we implement algorithms for the MSA and MCS problems based on the column-wise prefix sum on the 2D mesh architecture, as shown by Figure 2.This architecture is also known as a systolic array, where each processing unit has a constant number of registers and is permitted to only communicate with directly-connected neighbours.This seemingly inflexible architecture is well suited to be implemented on ASICs or FPGAs.Our most efficient parallel algorithms complete the computation in n − 1 communication steps for the MSA problem and 4n communication steps for the MCS problem, respectively.In each step, all cells execute a constant number of statements in parallel.Thus, our algorithms are cost optimal with respect to the cubic time sequential algorithms [3,4].
We give a formal proof for the parallel algorithm for computing the maximum W (a part of computing the MCS problem).The proof is based on the space-time invariants defined on the architecture.
In a 2D array architecture, some processors to the right are sitting idle in the early stage of computation waiting for inputs to arrive.This is because the data only flows from left to right and from up to down (Figure 2).If appropriate, we attempt to maximize the throughput by extending data flows to operate in four directions.This technique reduces the number of communication steps by a constant factor.
Algorithms are given by pseudocode.

Parallel Algorithms for the MSA Problem
In this section, we improve parallel algorithms for the MSA problem on a mesh array and achieve the optimal n communication steps.Furthermore, we show how the programming techniques in this section lead to efficient parallel algorithms for the MCS problem in the later sections.

Sequential Algorithm
The computation in Algorithm 1 [8] proceeds with the strip of the array from position k to position i.See Figure 3.The variable column[i][j] is the sum of array elements in the j-th column from position k to position i in array a.The variable sum[i][j], called a prefix-sum, is the sum of the strip from Position 1 to position j.Within this strip, variable j sweeps to compute column Then, the prefix sum of this strip from Position 1 to position j is computed by adding column is the minimum prefix sum of this strip from Position 1 to position j.If the current sum is smaller than min_sum is the maximum sum in this strip so far found from Position 1 to position j.It is computed by taking the maximum of sol expressed by max_sum in the figure.After the computation for this strip is over, the global solution, S, is updated by sol This computation is done for all possible i and k, taking O(n 3 ) time.

Parallel Algorithm 1
Algorithm 2 is a parallel adaptation of Algorithm 1.The following program is executed by a processing unit at the (i, j) grid point, which we refer to as cell(i, j).Each cell(i, j) is aware of its position (i, j).Data flow is from left to right and from top to bottom.The control signals are fired at the left border and propagate right.When the signal arrives at cell(i, j), it accumulates the column sum "column", the sum "sum" and updates the minimum prefix sum "min", etc. Figure 4 illustrates the information available to cell(i, j) at step k.At step 2n − 1, cell(n, n) will have computed the maximum sum.We assume that all corresponding instructions in all cells are executed at the same time, that is they are synchronized.We will later make some comments on asynchronous computation.for all 1 ≤ i, j ≤ n in parallel do 10: end for 19: end for

Parallel Algorithm 2
This algorithm (Algorithm 3) does communication bi-directionally in a horizontal way.For simplicity, we assume n is even.The (n, n) mesh is divided into two halves, left and right.The left half operates in the same way as Algorithm 2. The right half operates in a mirror image, that is control signals go from right to left initiated at the right border.All other data also flow from right to left.At the centre, that is at (i, n/2), cell(i, n/2) performs "centre ", which adds the two values that are the sums of strip regions in the left and right whose heights are equal and, thus, can be added to form a possible solution crossing over the centre.At the end of the k-th iteration, all properties in Algorithm 2 hold on the left half, and the properties in the mirror image hold on the right half.In addition, we have that centre[i] is the value of the maximum subarray that lies above or that is touching the i-th row and crosses over the centre line.
for all 1 ≤ i, j ≤ n in parallel do 10: end for 37: end for The strip cell(i, j) processes is a[i + j − k, ..., i][1, ..., j] in the left half, and that in the right half is a Thus, the cells cell(i, n/2) and cell(i, n/2 + 1) processes the strips of the same height in the left half and the right half.Communication steps are measured by the distance from cell(1, 1) to cell(n, n/2) or, equivalently, from cell(1, n) to cell(n, n/2 + 1), which is (3/2)n − 1.By adding the finalization step, we have (3/2)n for the total communication steps.

Parallel Algorithm 3
In Algorithm 4, data flow in four directions.The array is divided into two halves; left and right, as in the previous section.Column sums c and prefix sums s accumulate downwards as before, whereas column sums d and prefix sums t accumulate upwards.See Figure 5.The structure of Algorithm 4 reveals that at the end of the k-th iteration, The height of each subarray is k − j + 1.Since the widths of those two areas are the same, we can have the prefix sum That is, spending k steps, we can achieve twice as much height as that in Algorithm 3.
The solution array sol is calculated as before, but the result is sent into three directions; up, down and right in the left half and up, down and left in the right half.We have the property that sol For simplicity, we deal with the maximum subarray whose height is an even number.For a general case, see the note at the end of this section.
The computation proceeds with n − 2 steps by k and the last step of comparing the results from cell(n/2, n/2) and cell(n/2 + 1, n/2), resulting in n − 1 steps in total.
Note that we described the algorithm for the solution whose height is an even number.This fact comes from the assignment statement "u " where the heights of subarrays whose sums are s and t are equal.To accommodate a height of an odd number, we can use the value of t one step before, whose height is one shorter.To accommodate such odd heights, we need to almost double the size of the program by increasing the number of variables.

Review of Sequential Algorithm for the MCS Problem
We start from describing a sequential algorithm for W based on column sums, given by Fukuda et al. [3].We call the rightmost column of a W-shape the anchor column of W. The array portion of general array b, b[i, ..., k][j, ..., l], is the rectangular portion whose top left corner is (i, j), and the bottom right corner is The computation proceeds with the strip of the array from row k to row i (Figure 6) based on dynamic programming.Within this strip from row k to row i, variable j sweeps to compute 7) or extending a W-shape downward or upward by one (Cases 2 and 3 of Figure 7, respectively).Note that the width of the strip is given by t + 1.That is, we go through from thinner to thicker strips, so that the results for thinner ones are available when needed for Cases 2 and 3. Obviously, Algorithm 5 takes O(n 3 ) time.
Algorithm 5 Sequential algorithm for W.
end for 9: end for 10: for t ← 1 to n − 1 do 11: for j ← 1 to n do 14: 15: 16: 17: 18:  After all W are computed, we compute the N-shape in array N for all anchor columns by a mirror image of the algorithm for W.Then, the finalization of the W N-shape is given by Algorithm 6.
end for 8: end for Note that the W-shape and the N-shape share the same anchor column, meaning we need to subtract one column . This computation is done for all possible i, j and k, taking O(n 3 ) time, resulting in O(n 3 ) time for the maximum convex sum.

Improved Sequential Algorithm
We can observe that Algorithm 5 not only takes O(n 3 ) time, but also requires O(n 3 ) memory.The reason for this memory requirement is that the algorithm stores maximum W and maximum N for all possible anchor columns.
We can improve the memory requirement of Algorithm 5 significantly to O(n 2 ) with a simple modification.Firstly, we observe that there is no reason to keep all maximum W and maximum N for all O(n 3 ) possible anchor columns.Instead, we can iterate over the possible anchor column sizes and compute the O(n 2 ) maximum W and N for the given anchor column size in each iteration, thereby computing the maximum W N for the given anchor column size in each iteration.Thus, in each iteration, we only need O(n 2 ) memory, and there is no need to store maximum W and N values from previous iterations.Note that W and N on shorter columns are available for the t-th iteration.
Algorithm 7 is the pseudocode for achieving the O(n 2 ) memory bound.The pseudocode has been simplified, since much of the details have already been provided in Section 3.

Algorithm 7
Sequential algorithm for MCS.
Compute maximum W for all anchor columns of size t 3: Compute maximum N for all anchor columns of size t

4:
Combine W and N for all anchor columns of size t

5:
Store the current maximum W N 6: end for We note that the reduction in the memory bound is very significant in practical terms.The difference between O(n 2 ) memory and O(n 3 ) memory, if we take image processing as an example, is the difference between being able to process a mega-pixel image entirely in memory and having to resort to paging the results in an incomparably slow execution time.

Parallel Algorithm for MCS
We now give a parallel algorithm that corresponds to the sequential algorithm described in Section 3. Algorithm 8 is executed by the cell at the (i, j) grid point.Each cell(i, j) is aware of its position (i, j).Data flow is from left to right and from top to down.The control signals are fired at the left border and propagate right.When the signal arrives at cell(i, j), it starts to accumulate the column sum column and update top, W and sol.The value of W[i][j] at time k is to hold the maximum W-value based on the anchor column in the j-th column from position i + j − k to position i.This is illustrated in Figure 8.The value of sol[i][j] is to hold the best W-value obtained at cell(i, j) so far.The role of top is to bring down the top value of anchor column to cell(i, j) in time.The role of "old_W" is to provide the value of W one step before, with old_W[i][0] being undefined.Algorithm 8 Parallel algorithm for W. Note that the memory requirement for each cell in Algorithm 8 is constant.When we put W-shapes and N-shapes together, however, we need O(n) space in each cell, as we will explain later on.We assume that all corresponding instructions in all cells are executed in parallel in a synchronized manner.We later make some comments regarding how synchronization is achieved in the actual implementation of the parallel algorithm.
We prove the correctness of Algorithm 8 in the framework of Hoare logic [11] based on a restricted form of that in Owicki and Gries [12].The latter is too general to cover our problem.We keep the minimum extension of Hoare logic to our mesh architecture.
The meaning of Hoare's triple {P}S{Q} is that if P is true before program (segment) S, and if S halts, then Q is true after S stops.The typical loop invariant appears as that for a while-loop; "while B do S".Here, S is a program, and B is a Boolean condition.If we can prove {P ∧ B}S{P}, we can conclude {P} while B does S{P∧ ∼B}, where ∼B is the negation of B. P is called the loop invariant, because P holds whenever the computer comes back to this point to evaluate the condition B. This is a time-wise invariant as the computer comes back to this point time-wise.We establish invariants in each cell.They are regarded as time-space invariants because the same conditions hold for all cells as computation proceeds.Those invariants have space indices i and j and time index k.Thus, our logical framework is a specialization of Owicki and Gries to indexed assertions.
The main assertions are given in the following.scope[a, ..., b][c, ..., d] is the rectangular array portion from the top-left corner (a, c) to the bottom-right corner (b, d) where a candidate for the solution can be found.
At the end of the k-th iteration, the following holds in cell(i, j): For 1 ≤ i ≤ n and 0 ≤ j ≤ k: which is the sum of the j-th column of array a from position i + j − k to position i The above are combined to P, where: Here, Q states that variables in each cell keep the initial values.In the following descriptions, we omit the second portion Q of the above logical formula.
We can prove that P 0 (k) to P 5 (k) are all true for k = 0 by checking the initialization.For each P 0 to P 5 , we omit indices i and j.Using the time index k, we prove {P(k − 1)}cell(i, j){P 0 (k)} to {P(k − 1)}cell(i, j){P 5 (k)}.We use the following rules of Hoare logic.Let x 1 ← y 1 to x n ← y n be assignment statements in cell(1) to cell(n) in general.There can be several in each cell.We use one for simplicity.The meaning of y i /x i is that the occurrence of variable x i in Q is replaced by y i .Parallel execution of cell (1) Parallel assignment rule:

||cell(n)]{Q}
Other programming constructs such as composition (semi-colon), if-then-else statement, etc., in sequential Hoare logic can be extended to the parallel versions.Those definitions are omitted, but the following rule for the if-then and for-loop for the sequential control structure, which controls a parallel program S from outside, is needed for our verification purpose.

Rule for if-then statement:
{P ∧ B}S{Q}, P ∧ ¬B ⇒ Q {P} if B then S{Q} Each cell(i) has a few local variables and assignment statements.For an arbitrary array x, we regard x[i][j] as a local variable for cell(i, j).A variable from the neighbour, x[i − 1][j], for example, is imported from the upper neighbour.Updated variables are fetched in the next cycle.The proof for each {P(k − 1)}cell(i, j){P(k)} for P is given in Appendix.
Theorem 1. Algorithm 8 is correct.The result is obtained at cell(n, n) in 2n − 1 steps.
Proof.From the Hoare logic rule for the for-loop, we have P 5 (2n − 1) at the end.
We used array W[1..n][1..n] to compute the maximum W-shape.The value of W[i][j] at cell(i, j) is ephemeral in the sense that its value changes as computation proceeds.That is, W[i][j] at time k holds the maximum W-shape anchored at column a[i + j − k, ..., i][j], and at the next step, it changes to that value of W anchored at column a[i In order to combine W and N, we must memorize the maximum W and the maximum N for each anchor column.Thus, we need the three-dimensional array store_W, as well as the array store_N.Since in the final stage of W N computation, the same anchor column is added from W and N, we need to subtract the sum of the anchor column.For that purpose, the sum of anchor column a[k, ...
Note that the computation of N goes from right to left.
In Algorithm 9, we provide the complete algorithm that combines W and N, where sol is dropped, and the initialization is omitted.The computation of W and N takes 2n − 1 steps, and W N takes n steps.
.n] can be done in parallel.Algorithm 10 and Figure 9a illustrate how to find the maximum in a 2D mesh in 2n steps, where the maximum can be retrieved at the bottom right corner.If we orchestrate the bidirectional data movement in each of four quarters of the mesh (Figure 9b), so that the maximum of each quarter can meet at the centre for the final selection, it can be done in n + 1 steps.Therefore, Algorithm 9 takes total of 4n communication steps.for all 1 ≤ i, j ≤ n in parallel do Compute W from left to right 4: end for 19: end for 20: for all 1 ≤ i, j ≤ n in parallel do 21: 1: for all 1 ≤ i, j ≤ n in parallel do 2: for all 1 ≤ i, j ≤ n in parallel do 6:

end for
Find the maximum in the same column 8: end for 9: for k ← 1 to n do 10: for all j in parallel do 11:

end for
Find the maximum in the n-th row 13: end for M[n][n] is the maximum

Computation of the Boundary
So far, we computed the maximum sum of the W N-shape.For practical purposes, we sometimes need the shape itself.This problem was solved by Thaher [13] for the sequential algorithm (Algorithm 11).We show how the idea works in our parallel situation.For simplicity, we only show the boundary computation of the W-shape.We prepare the array store_d[i][k][j] to memorize the direction from which the algorithm took one of Case 1, Case 2 or Case 3.This part of enhancement in Algorithm 9 is shown below.
) is on the boundary, and zero otherwise.Let (i, k, j) denote the j-th column from position k to position i, to represent the anchor column of some W-shape.Suppose the anchor column of the maximum W-shape after Algorithm 9 is (i 0 , k 0 , j 0 ).The following sequential algorithm goes backward from (i 0 , k 0 , j 0 ) guided by the data structure store_d[i][k][j].The correctness of the algorithm is obvious.The time O(n) can be shown as follows.
Algorithm 11 Computing the boundary.

end if 13: end while
The time is proportional to the number of changes on indices i, j and k.The index j can be reduced at most n times.For i and k, we observe i − k decreases whenever i or k changes, resulting in O(n) time for i and k.The O(n) time for the boundary can be absorbed in O(n) time of Algorithm 9.It will be easy to organize parallel computation of tracing back over array store_d[i][k][j] by the mesh architecture.
For example, we can convert the array index j to a processor index j and use a one-dimensional mesh architecture as shown below (Algorithm 12).
This version has O(n) time, which provides no gain time-wise, but the O(n 3 ) space requirement on a single processor can be eased.

Algorithm 12
Computing the boundary; j-th cell.

Implementation
We implemented Algorithm 9 on the Blue Gene/P computer under the MPI/Parallel C program environment.There were many practical issues to be considered.We summarize just three issues here as representatives.
Firstly, we cannot assume that each cell knows its own position within the mesh array.Depending on the architecture, additional computation is required for each cell to gather this information.Within the MPI environment, we can let each cell(i, j) know its position (i, j) by the system call "MPI_Cart_coords()".
The second issue is synchronization.We assumed the corresponding statements in all cells are executed in a synchronized manner.If we remove this assumption, that is if the execution proceeds in an asynchronous manner, the algorithm loses its correctness.
In MPI, "MPI_Send()", "MPI_Recv()" and "MPI_Sendrecv()" functions are used to perform synchronous blocking communications [14].As we call these functions in each step, no further mechanisms are necessary to ensure synchronization between cells as the function calls ensure that any given cell cannot progress one or more steps further than the rest.In other words, one cell may reach the end of the given step and tries to move onto the next step before others, but then, the cell must wait for the other cells to reach the same point due to the blocking nature of the MPI communication functions.
The third implementation issue is related to the number of available processors.As the number of processors is limited, for large n, we need to have what is called a coarse grain parallel computer.Suppose, for example, we are given a (1024, 1024) input array and only 16 processors are available.The input array is divided into sixteen (256, 256) sub-arrays, to which the sixteen processors are assigned.Let us call the sub-array for each processor its territory.Each processor simulates one step of Algorithm 8 sequentially.These simulation processes by sixteen processors are done in parallel.At the end of each simulation, the values in the registers on the right and bottom border are sent to the left and top borders of the right neighbour and the lower neighbour, respectively.The simulation of one step takes O((n/p) 2 ) time, and 2n − 1 steps are carried out, meaning the computing time is O(n 3 /p 2 ) at the cost of O(p 2 ) processors.When p = 1, we hit the sequential complexity of O(n 3 ).If p = n, we have the time complexity of Algorithm 8, which is O(n).
While resolving the third implementation issue, we must again face the second issue of synchronization.For each processor to simulate the parallel computation in its own territory in a coarse-grained implementation, we must take extra care to ensure synchronization between simulated cells within each territory.Specifically, we must double the number of variables, that is we prepare variable x 1 for every variable x.Let us associate the space/time index, (i, j, k) with each variable.Let us call x(i, j, k) the current variable and the variable with indices different by one a neighbour variable.For example, sol[i][j] in the right-hand side of the assignment statement is a time-wise neighbour, and that at the left-hand side is a current variable.Furthermore, column[i − 1][j] in the right-hand side is a neighbour variable space-wise and time-wise, and so on.If x is a current variable, change it to x 1 .If it is a variable of a neighbour, keep it as it is.Let us call the modified program P 1 .Now, we define "update" to be the set of assignment statements of the form x ← x 1 .
Example 1.Let P be a one-dimensional mesh program given by Algorithm 13, which shifts array x by one place.Let us suppose x[0] = 0 and x[i] are already given.
In Algorithm 13, x[i] is the current variable, and x[i − 1] is a neighbour variable space-wise and time-wise.An asynchronous computer can make all values zero.For the intended outcome, we perform P 1 , given by Algorithm 14, which includes "synchronize" and "update".
Algorithm 13 Program P.
1: for all i in parallel do 2: x 1 [i] ← x[i − 1] 3: end for 4: synchronize 5: for all i in parallel do 6: x[i] ← x 1 [i] /* update */ 7: end for For our mesh algorithm, Algorithm 8, omitting the initialization part, we make the program of the form that is given by Algorithm 15.
Algorithm 15 Synchronization for coarse grain.update 5: end for Algorithm 9 was executed on Blue Gene/P with up to 1024 cores.For the software side, the programming environment of MPI and the parallel C compiler, mpixlc, were used with Optimization Level 5.The results are shown in Table 1.Arrays of size (n, n) containing random integers were tested.The times for generating uniformly-distributed random numbers, as well as loading the data onto each processor were not included in the time measurement.On Blue Gene/P, the mesh architecture can be configured into a 2D mesh or 3D mesh.The time taken to configure the mesh into a 2D mesh array for our algorithm was included in the time measurement.As we can see from the table, for small n, the configuration time dominates, and increasing the number of processors does not result in a decrease in the execution time.As the size of the input array increases, however, we can see noticeable improvements in the execution times with a larger number of processors.
Note that we were unable to execute the program under certain configurations as shown by 'x' in Table 1.This was due to memory requirements.With large n and small p, each processor must simulate a very large territory.With each cell in the territory requiring O(n) memory to store the maximal values of W and N, the memory requirements can become prohibitive.This highlights the fact that parallelization is required not only to reduce the execution time, but also to handle larger input data.

Lower Bound
Algorithm 8 for W is not very efficient, as cells to the right are idling at the early stage.Suppose we are given an input array with the value a in the top-left cell and b in the bottom-right cell, while all other value are −∞.Obviously the solution is a if a > b, and b otherwise.The values a and b need to meet somewhere.It is easy to see that the earliest possibility is at time n − 1.Our algorithm for W takes 2n − 1 steps, meaning there is still a gap of n steps with the lower bound.This is a sharp contrast with the mesh algorithm for MSA (Algorithm 4) that completes in n − 1 steps.The role of Algorithm 8 is to establish an O(n) time for the MCS problem on the mesh architecture.

Concluding Remarks
We gave an O(n) time parallel algorithm for solving the MSA and MCS problem with n 2 processors and a formal proof for the algorithm to compute the W-shape, a part of the MCS problem.The formal proof for the N-shape can be given in a similar way.The formal proof not only ensures correctness, but also clarifies what is actually going on inside the algorithm.
The formal proof was simplified by assuming synchronization.The asynchronous version with (synchronize, update) in Section 7 would require about twice as much complexity for verification, since we double the number of variables.Once the correctness of the synchronized version is established, that of the asynchronous version will be acceptable without further verification.
It is open whether there is a mesh algorithm for the MCS problem with the O(1) memory requirement in each cell.Mesh algorithms are inherently easy to implement on an FPGA and, thus, can be considered for practical applications.

Figure 1 .
Figure 1.Ozone hole identified by the maximum subarray (MSA) and maximum convex sum (MCS) algorithms (Courtesy of NASA).

Figure 4 .
Figure 4. Illustration for Algorithm 2: The coverage of cell(i, j) at step k.

Initialization 1 :
for all 0 ≤ i, j ≤ n in parallel do 2:

Figure 6 .
Figure 6.Sequential computation with position indices i, j, k.

Figure 8 .
Figure 8. Illustration for Algorithm 8 with position indices (i, j) and time index k.