Space-Time Loop Tiling for Dynamic Programming Codes

: We present a new space-time loop tiling approach and demonstrate its application for the generation of parallel tiled code of enhanced locality for three dynamic programming algorithms. The technique envisages that, for each loop nest statement, sub-spaces are ﬁrst generated so that the intersection of them results in space tiles. Space tiles can be enumerated in lexicographical order or in parallel by using the wave-front technique. Then, within each space tile, time slices are formed, which are enumerated in lexicographical order. Target tiles are represented with multiple time slices within each space tile. We explain the basic idea of space-time loop tiling and then illustrate it by means of an example. Then, we present a formal algorithm and prove its correctness. The algorithm is implemented in the publicly available TRACO compiler. Experimental results demonstrate that parallel codes generated by means of the presented approach outperform closely related manually generated ones or those generated by using afﬁne transformations. The main advantage of code generated by means of the presented approach is its enhanced locality due to splitting each larger space tile into multiple smaller tiles represented with time slices.


Introduction
In this paper, we deal with the generation of parallel tiled code for dynamic programming codes.
Increasing dynamic programming code performance is not a trivial problem because, in general, that code exposes affine non-uniform data dependence patterns typical for nonserial polyadic dynamic programming (NPDP) [1], preventing tiling the innermost loop in a loop nest that restricts code locality improvement. There are many state-of-theart manual transformations for DP algorithms [2][3][4][5] and dedicated software [6][7][8]. The inherent disadvantage of those solutions is that they were developed for specific dynamic programming tasks. Thus, in general, they cannot be applied to an arbitrary DP algorithm. In addition to that, a manual parallel code development may be very costly.
Fortunately, dependence patterns of NPDP code can be represented with the polyhedral model [9]. Thus, well-known polyhedral techniques and corresponding compilers can be applied to automatically tile and can parallelize such a code, for example, the PLuTo and PPCG academic compilers as well as the commercial R-STREAM and IBM-XL compilers. They extract and apply affine transformations to tile and parallelize loop nests. They have demonstrated considerable success in generating high-performance parallel tiled code, in particular, for uniform loop nests.
However, optimizing loop nests, which exposes affine non-uniform dependences typical for dynamic programming codes, by means of affine transformations is not always possible or fruitful: tiling or parallelization is not possible, only some loops (not all loops) in a nest can be tiled, or the generated parallel code is not scalable [10][11][12].
In order to increase code locality and generate coarse-grained code, loop tiling [13][14][15][16][17][18][19] can be applied. Tiling for improving locality groups loop statement instances into smaller blocks (tiles) allows reusability when the block fits into local memory. In parallel tiled code, tiles are considered as indivisible macro statements each executed with a single thread that increases code coarseness.
PLuTo [13] is the most popular state-of-the-art, source-to-source polyhedral compiler that automatically generates parallel tiled code.
Unfortunately, PLuTo fails to tile all loops in a given loop nest in the case of DP programs exposing non-uniform dependences [10]. This reduces the locality of target tiled code.
The idea of tiling presented in our previous paper [10] is to transform (correct) original rectangular tiles so that all target tiles are valid under lexicographical order. Tile correction is performed by the transitive closure to loop nest dependence graphs. However, the correction technique can generate irregular tiles, and some of them can be too large [20]. Those drawbacks do not allow us to achieve maximal code locality and performance [21].
In this paper, we present a new approach, which enables us to generate parallel tiled code of enhanced locality for a broad class of dynamic programming codes. Experimental results demonstrate that the code generated by means of the presented approach outperforms closely related ones.
The approach presented in this paper can be applied to any affine loop nest, but its effectiveness is empirically confirmed by us only for a class of dynamic programming codes that expose affine non-uniform dependences for which its features prevent tiling one or more the innermost loops in a loop nest. Papers [11,12] affirm that achieving peak performance for dynamic programming codes requires tiling all loops (including the innermost one) when the full dynamic programming matrix is significantly larger than the cache size. We also empirically discovered that tiling the innermost loop is very important for improving code locality for dynamic programming codes [21].
Thus, currently we limit the proposed approach only to dynamic programming codes exposing dependences preventing tiling the innermost loop. Adapting the approach for resolving other problems is a topic for future research.

Background
In this paper, we deal with generation of parallel tiled code for the three dynamic programming codes implementing the Smith-Waterman (SW) algorithm, the counting algorithm, and the Knuth optimal binary search tree (OBST) algorithm.
The Smith-Waterman algorithm explores all the possible alignments between two sequences, and it returns the optimal local alignment guarantying the maximal sensitivity as a result [22].
It constructs a scoring matrix, H, which is used to keep track of the degree of similarity between the cells a i and b j of two sequences to be aligned, where The next step is filling matrix H. Each element H i,j of H is calculated as follows: where s(a i , b j ) is a similarity score of elements a i and b j that constitute the two sequences, and W k is a penalty of a gap that has length k. The final step of the SW algorithm is trace back, which generates the best local alignment. The step starts at the cell with the highest score in matrix H and continues up to the cell, where the score falls down to zero.
The SW algorithm is O(MN(M + N)) in time and O(MN) in memory. The extra time factor of (M + N) comes from finding optimal k by looking back over entire rows and columns.
The loop nest implementing the SW algorithm is presented in Listing 1. The counting algorithm computes the exact number of nested structures for a given RNA sequence. It was also introduced by Michael S. Waterman and Temple F. Smith [23]. The authors applied NPDP and tabularized results for subproblems. The approach populates the matrix C with the following recursion: where n is the sequence's S length, l is the minimal number of enclosed positions, and the entry C i,j provides the exact number of admissible structures for the subsequence from position i to j. The upper right corner C 1,n presents the overall number of admissible structures for the sequences.
The code implementing the counting algorithm is presented with Listing 2. The third NPDP benchmark that we consider in this paper is the optimal binary search tree (OBST) [24], which is the case when the tree cannot be modified after it has been constructed.
Knuth's OBST algorithm populates matrix C and is represented with the following recurrence: where W(i, j) is the sum of the probabilities that each of the items i through j will be accessed. The recurrence can be implemented with the triple nested loops presented in Listing 3. All the three loop nests above are within the class of affine loops, i.e., for given loop indices, lower and upper bounds, as well as array subscripts and conditionals, are affine functions of surrounding loop indices and possibly of structure parameters (defining loop index bounds), and the loop steps are known constants.
Thus, the affine transformation framework [16,25] can be applied to each of those loop nests in order to generate parallel tiled code manually or automatically. However, for each of them, affine transformations do not exist that would allow for tiling the innermost loop. This considerably reduces target code locality.
Perfectly nested loops are ones wherein all statements are comprised in the innermost loop, otherwise the loops are arbitrarily nested.
Given a loop nest with q statements, its polyhedral representation includes the following: an iteration space IS i for each statement S i , i = 1, . . . , q, read/write access relations (RA/WA, respectively), and global schedule S corresponding to the original execution order of statement instances in the loop nest.
The loop nest iteration space IS i is the set of statement instances executed by a loop nest for statement S i . An access relation maps an iteration vector I i to one or more memory locations of array elements. Schedule S is represented with a relation, which maps an iteration vector of a statement to a corresponding multidimensional timestamp, i.e., a discrete time when the statement instance has to be executed.
The algorithms presented in this paper use a dependence relation that is a tuple relation of the form { [input list] → [output list] | formula }, where input list and output list are the lists of expressions used to describe input and output tuples, and formula describes the constraints imposed upon input and output lists. It is a Presburger formula built of constraints represented by algebraic expressions and uses logical and existential operators.
In the presented algorithm, standard operations on relations and sets are used, such as intersection (∩), union (∪), and application of relation R on set S: ∈ R }, i.e., this operation results in a set for which its tuple e is the output tuple of relations R ∈ R in which its input tuple e belongs to set S.
A global (original) schedule is a relation that maps an iteration vector of a statement to a unique multidimensional discrete time. Such a schedule presents the lexicographic order of loop nest statement instances in the global (common) iteration space. In order to extract a global schedule, we apply the PET tool [26].

Overview of Space-Time Tiling
For a given loop nest, optimizing compilers based on affine transformations such as PLuTo to generate tiles for which its dimension is equal to the number of linear independent solutions to the time partition constraints formed for that nest [16]. If this number is less than the loop nest depth defined with the number of loops in a nest, the target tiles are unbounded for loops with parametric (unbounded) upper bounds.
For example, if, for a loop nest of depth three in which its loop indexes are i, j and k, there exist only two linearly independent solutions to the time partition constraints formed for that loop, we are able to form only two-dimensional tiles m × n, where m and n are the sizes of a tile along loop indices i and j, respectively. Let two new outer loops ii and jj be responsible for enumerating tiles in the target code. Then, in that code for given values of ii and jj, the three inner loops i, j and k enumerate statement instances within a hypercube in which its sides are bounded along axes i and j with values of m and n, but the size along axis k is not limited. This is equivalent to the fact that each target tile is unbounded (parametric) when the upper bound of k is a parameter. As a result, for large size problems, all the data associated with such an unbounded tile cannot be held in a cache that reduces code locality.
In order to increase code locality, we propose to split each unbounded target space tile into smaller ones. Such a splitting is based on forming time slices. A time slice is a set of statement instances belonging to one or more the same time partitions. Statement instances within a time partition have the same multidimensional execution time. Time slices can be formed by means of any valid statement instance schedule.
As a result, we increase the target tile dimension by one. That additional dimension is implemented with an additional loop in the target tiled code. It enumerates time partitions-smaller tiles-within a larger tile. As our experiments demonstrate, for dynamic programming codes, this improves code locality, which results in improving parallel-tiled code performance.
The following section presents details how the presented above idea of space-time tiling can be realized.

Space-Time Tiling
In this section, we first illustrate space-time tiling by means of a simple loop nest; Then, we demonstrate how it can be adapted to arbitrarily nested loops, discuss how parallel tiled code can be generated, and finally present a formal algorithm.

Tiling a Simple Loop Nest
Let us consider the following loop nest. Example 1. for(i = 1; i <= n; ++i) for(j = 1; j <= n; ++j) The relation below represents dependences available in the loop nest above: are the relation tuples; 0 < i ≤ n ∧ 0 < j < n and 0 < i < n ∧ 2 ≤ j ≤ n are the affine relation constraints; ∧ is the conjuction operator of affine constraints; and ∪ is the union operator of relations. Figure 1a shows dependences (black arrows) for the loop nest when n = 4. We split the entire iteration space into two subspaces, each of width two (in general, any width can be chosen). A parametric set, SPACE, below represents those sub-spaces: where parameter id_sp is the identifier of a sub-space. For each particular value of parameter id_sp, we obtain a specific set representing the corresponding sub-space. In Figure 1a, the black vertical line divides the entire iteration space into two subspaces, SPACE0 and SPACE1, defined with parameters id_sp = 0 and id_sp = 1, respectively.
Let IS be the loop nest iteration space. Then, we form a valid affine schedule for loop nest iterations by applying the "m3 := schedule IS respecting m1 minimizing m2" operator of the iscc calculator [27], which computes a schedule for loop nest iteration space IS that respects all dependences in relation m1 and tries to minimize the dependences in relation m2. As m1 and m2, we take relation R and obtain the following schedule in the tree form [28]: : 0 < i <= n and 0 < j <= n and ((i < n and j >= 2) or (i >= 2 and j < n) or j >= 2 or j < n) }" schedule: where the lines beginning with the word domain represent the iteration domain where the schedule returned for the considered loop nest is valid. The lines beginning with the word schedule represent the two different schedules for the loop nest of Example 1, i.e., In order to form a relation implementing wave-fronting [29], we calculated the sum of the two schedules above: i and i + j that results in the expression 2i + j, which allows for statement instance parallelization [29]. We present a target schedule with relation, SCHED, which maps each loop nest statement instance to a time partition, as follows: where IS is the loop nest iteration space, and ∩ is the operator of the intersection of the domain of the relation SCHED : Iterations defined with vector (i, j) T and belonging to the same schedule time (2i + j) can be executed in parallel, for example, iterations (1, 3) and (2,1). Figure 1a presents ten time partitions shown with blue lines (t0, t1, . . . , t9). Using those partitions, we form parametric time slices, each including the same number of time partitions. Supposing that each time slice includes three time partitions (in general, an arbitrary number of time partitions can be included in a single time slice), we obtain the following formula for set, TI ME, representing time slices: where id_t is the parameter defining the identifier of a time slice. The value of parameter id_t defines a specific time slice, for example, for n = 4 and id_t = 0, we obtain the following set.
TI ME : Let us note that the size of sub-spaces represented with set SPACE is unbounded (parametric) for a parametric loop nest. Using such subspaces as tiles can reduce code locality when the size of the data associated with a subspace is greater than cache size. To improve code locality, we intersected subspaces represented with set SPACE with time slices described with set TI ME. This causes the splitting of each sub-space into smaller target tiles, which allows us to improve code locality.
Thus, we calculate a parametric set, TILE, defining target tiles as follows.
For the considered example, this set is the following.
Enumerating obtained tiles in lexicographical order is valid because all elements of distance vectors of inter-tile dependeces are non-negative. To prove this fact, it is enough to perceive that, for subspaces represented with set SPACE, there exist only forward dependence directions (no backward dependence directions). Thus, dependences between subspaces spread from ones with smaller values of parameter id_sp to those with greater ones. This prevents any cycle among subspaces. We make the same conclusion regarding to time slices represented with parameter id_t: Dependences between them spread only in the forward direction. Thus, all elements of distance vectors of inter-tile dependences are non-negative.
In order to generate target code, we first form relation, CODE, using set TILE as follows.
That relation maps each iteration [i, j] within the iteration space IS to the tuple id_sp, id_t, i, j, which represents the tile identifier [id_sp, id_t] and the iteration itself [i, j]. Then, we apply the iscc codegen operator to relation CODE in order to obtain the following pseudo-code.

Imperfectly Nested Loops
For imperfectly nested loops, each statement has a local iteration space. In general, iteration spaces of distinct statements can be of different dimensions. Thus, it is not possible to directly calculate distance vectors of dependences for which sources and destinations originated with instances of different statements. To cope with this problem, we formed a global iteration space common to instances of all statements. Let us remind the reader that a global schedule presents the original (serial) execution order of each loop nest statement in an iteration space common (global) to all statements. Thus, we apply a global schedule on each named tuple of a dependence relation (see the description of the relation application operator on a set in Section 2). For this purpose, we apply the iscc operator apply map m to set s, where m is the relation representing global schedule, s is a particular tuple of the dependence relation. This results in a new dependence relation for which its tuples are unnamed and of the same size; it describes dependences in a global (common) iteration space for all loop nest statement instances.
Distance vectors can be calculated as the difference between the image and domain of that new relation (representing dependences in the global iteration space) by means of the iscc deltas operator. This is possible because the image and domain of that relation are represented with affine sets for which its tuples have the same dimensions.
In general, for affine dependences, obtained distance vectors may not include only integer element values, and their elements can be represented with affine expressions. Thus, calculated distance vectors have the following meaning: They represent all the possible distances between the source and destination of each dependence available in a given loop nest in the global iteration space. In general, for affine dependences, the number of such distances can be unlimited (parametric).
Next we convert those vectors to a single direction vector, which characterizes the directions of all distance vectors. Each element of this vector holds "+" ("−") if the corresponding element in all distance vectors is non-negative (at least one element is negative). It is worth noting that, in general, the length of a distance vector can be larger than that of a direction vector because a distance vector can include additional constants inserted in tuples of a global schedule. However, all distance vectors in the global (common) iteration space are of the same length, and the constants inserted are in the same positions for all vectors.
Throughout the whole paper, we use the following notations: I i is the iteration vector of the iteration space of statement Si, PARAMS i denotes the structure parameters of the loops surrounding statement Si, IS i is the iteration space of statement S i , S i [I i ] is the named tuple representing the iteration vector I i of statement S i , and T i (I i ) is the global (original) schedule time of iteration I i of statement S i .
To form a direction vector, which symbolizes all dependence vectors available in a given loop nest, we apply Procedure 1 below, which takes into account that the first element of each distance vector is non-negative (negative) if the corresponding loop iterator is incremented (decremented). Apply the deltas operator of the iscc calculator to relation R to calculate distance vectors in the common iteration space.
Let us consider the following example of the distance vector: 1. j = 1; k = 1; 2.
If DIR_VEC(j)== "+", then form the following sets Form set SPACE j as follows; . . , l m . For each of q loop nest statements, we form a valid schedule respecting all the dependences represented with relation R and allowing for wavefronting of any well-known technique, for example [28], and present it as the following relation: where tuple [t 1 , t 2 , . . . , t k i ] represents the k i -dimensional schedule for instances of statement S i .
If a schedule allowing for wave-fronting cannot be formed (a scheduler returns only one schedule for each loop nest statement), then we skip the steps aimed at forming time slices.
In general, relation SCHED i maps each instance of statement S i to a discrete multidimensional time. A set of statement instances belonging to the same multidimensional time defines a time partition. Time partitions are represented with the inverse relation, For each of q loop nest statements, using relation SCHED −1 i , we form set TI ME i defining time slices each including a constant number of time partitions: where t k i is the k i -th dimension of schedule SCHED i ; parameters t 1 , t 2 , . . . , t k i −1 , id_t define the identifier of a time slice; and n_t determines the number of time partitions within a time slice. Constant n_t is responsible for defining the number of time partitions within the time slice for which its identifier is (t 1 , t 2 , . . . , t k i −1 , id_t) T , i.e., for a given i, the inequality n_t * id_t <= t k i <= n_t * (id_t + 1) − 1 describes the interval in which the value of t k i changes. The number of the values of that interval equals the number of the time partitions within a time slice. The choice of a one-dimensional schedule (t k i ) in relation SCHED i for defining the number of time partitions within a time slice is justified with practical observation. Such a choice is enough to form time slices for which its size is satisfactory in practice. Experiments with loop nests presented in Section 6 confirm that such a choice allows defining a large enough number of time partitions within a single time slice, which results in acceptable tiled code performance.
The formula to calculate sets, TILE i , for each statement Si, i = 1, 2, . . . , q, representing target tiles is the following: Let us note that the intersection l 1 j=l m SPACE j results in rectangular tiles for each statement i = 1, 2, . . . , q. In general, the sizes of those tiles can be unbounded (parametric) when the number of non-negative elements of vector DIR_VEC is less than the number of loop nest iterators (loops). Using such tiles can reduce code locality when the size of the data associated with a rectangular tile is greater than cache size. In order to improve code locality, for each statement i = 1, 2, . . . , q, we intersect the rectangular tiles with time slices represented with set TI ME i . This causes splitting of each rectangular tile into smaller target tiles, which allows us to improve code locality.
It is worth noting that the dimension of tiles obtained with the intersection of rectangular tiles with time slices represented with set TI ME i is one more than the dimension of rectangular slices obtained as the intersection l 1 j=l m SPACE j , i = 1, 2, . . . , q. The intersection of rectangular tiles with time slices represented with set TI ME i is the basic idea of the approach proposed in this paper.
The identifiers of tiles are represented with the following vector: which can be re-written as stated below: where ID space = (ii l 1 , ii l 2 , . . . , ii l m ) T , ID time = (t 1 , t 2 , . . . , id_t) T . Identifier ID space defines a sub-space being the intersection of sub-spaces SPACE j , j = l 1 , l 2 , . . . , l m . Since dependences among subspaces along axis j = l 1 , l 2 , . . . , l m are spread only in the forward direction (due to the fact that the corresponding elements of the common direction vector are positive), all the corresponding dependence distance vectors regarding vector ID space have only non-negative elements. Thus, enumerating tiles regarding vector ID space in lexicographical order is valid.
Within each subspace represented with identifier ID space , enumerating time slices defined with identifier ID time in lexicographic order is also valid because dependences along time slices spread from a slice with a lexicographically smaller identifier to those with larger ones. Thus, enumerating tiles where its identifiers are represented with vector TILE_ID in lexicographic order is valid.
It is worth noting that dependence distance vector ID time can have negative elements, with the exception of those from the first one (t 1 ). For example, for instances of some statement, two multidimensional schedules (2, 1, 1) T and (1, 2, 2) T can be valid. Thus, for that case ID time = (1, −1, −1) T .
In order to generate serial code enumerating tiles in lexicographical order, we transform each set TILE i to relation CODE i , i = 1, 2, . . . , q of the following form: where T i (I i ) is the multidimensional execution time of iteration I i in the global iteration space. Relation CODE i maps each instance of statement Si, S i (I i ), to a tile identifier and the execution time T i (I i ) in the global iteration space. Then we form the following relation: where the tiled code with the iscc codegen operator relative to relation CODE is generated. The generated code enumerates tiles in lexicographical order regarding vector TILE_ID, representing tile identifiers as well as statement instances within each tile.

Parallel Code Generation
In order to generate parallel tiled code, we take into account that, in the global iteration space, all dependence distance vectors ID space have only non-negative elements as well as the fact that the first element of all dependence distance vectors ID time is non-negative (see the previous subsection). For such a case of distance vectors, the wave-fronting technique [29] can be applied to generate parallel tiled code. It remaps an iteration space by creating a new loop for which its index is a linear combination of two or more loop iterators for which its corresponding elements of all distance vectors are non-negative [18]. This results in code where the outermost loop is serial, while one or more inner loops enumerating tile identifiers can be parallel. To implement wave-fronting, we generated the following relation: where ii 0 is the new iterator formed as the sum of all elements of vector ID space and the first element of vector ID time . When a schedule used is one-dimensional instead of t 1 , we use parameter id_t. That relation maps each iteration I i of statement S i to time partition ii 0 , including tiles, which can be executed in parallel, while statement instances within each tile are to be run serially. Target parallel tiled code is generated automatically with the TRACO compiler (traco. sourceforge.net (accessed on 1 September 2021)). First, TRACO forms relation CODE := ∪ q i=1 CODE i , which represents tile execution according to the wave-fronting technique. Then, it applies the iscc codegen operator to relation CODE and obtains pseudo-code in the C language. Finally, by using the property of wavefronting where the first loop in that pseudocode is serial while the second one is parallel (in general, the number of parallel loops is equal to the number of non-negative elements of distance vector DIR_VEC-see the previous subsection), TRACO inserts the OpenMP parallel for directives directly before the second loop of that pseudocode, making it parallel.

Formal Algorithm
Algorithm 1 below is the formal description of the tiling concept presented in the previous subsections. The first step envisages generation of a polyhedral representation of a loop nest. The second one, for each loop nest statement Si, i = 1, 2, . . . , q, forms set TILE i and then converts it to relation CODE i , which enables the generation of tiled code. To produce set TILE i , first by means of Procedures 1 and 2, sets SPACE j , j = l 1 , l 2 , . . . , l m are formed. They represent subspaces of given widths b j along axes j = l 1 , l 2 , . . . , l m . Within those subspaces, dependences are spread only in the forward direction because the corresponding elements of a common direction vector are positive.
Then, the algorithm tries to extract a schedule allowing for wave-fronting. This is possible when there exist at least two different schedules for instances of at least one loop nest statement. If such a schedule cannot be formed, then set TILE i is calculated as the intersection of all the sets representing subspaces. Otherwise, set TI ME i is formed. It describes time slices each including a constant number of time partitions. Then, it is used to calculate set TILE i as the intersection of all the sets representing subspaces and a set defining time slices. Finally, for each statement, relation CODE i is built, and the iscc code generator is applied to the sum of those relations in order to generate the pseudocode, which is then converted to target parallel compilable code by means of postprocessing.
Target tiles defined with sets TILE i are de facto time slices inside each space tile calculated as the intersection of all the subspaces represented with sets SPACE j and j = l 1 , l 2 , . . . , l m .
It is worth noting that the positions of the "+" elements in a common direction vector point out what rectangular subspaces have to be formed while the number of the "+" elements in this vector defines the dimensionality of generated space tiles. Target tiles generated as the intersection of rectangular subspaces formed using a common direction vector results in rectangular tiles. This is an advantage of the presented technique in comparison with ones based on affine transformations for which it does not guarantee the generation of rectangular tiles. Table 1 represents the features of target tiles and target parallel code provided that the number of positive elements in a common direction vector is equal to m. In general, when sets TI ME i are used for the generation of target tiles, the tile shape is arbitrary. Its size is defined with the number of statement instances within a time slice. The parallelism degree measured with the maximal number of parallel loops of target code is equal to m. For each statement Si, i = 1, 2, . . . , q, perform the following: (a) Apply Procedures 1 and 2 to form sets SPACE j , j = l 1 , l 2 , . . . , l m ; (b) Any well-known technique, for example [28], form a valid schedule respecting all the dependences represented with relation R and allowing for wave-fronting. If such a schedule does not exists, then time = f alse, proceed to step 2d); otherwise time = true, form a schedule represented with the following relation: Using relation SCHED i , form the set TI ME i defining time slices: If time == true, then form the set TILE i as follows: TILE i := TI ME i ∩ l 1 j=l m SPACE j = [PARAMS i , ii l 1 , ii l 2 , . . . , ii l m , t 1 , t 2 , . . . , id_t] → {[I i ]|constraints i }; otherwise: Using set TILE i and its constraints i , form the following relation CODE i CODE i := [PARAMS] → {[I i ] → [ii 0 , ii l 1 , ii l 2 , . . . , ii l m , t 1 , t 2 , . . . , id_t, T i (I i )]|ii 0 = ii l 1 + ii l 2 + . . . + ii l m + t 1 (id_t) ∧ constraints i }. /* if time == f alse, then variables t 1 , t 2 , . . . , id_t are absent; for one-dimensional schedule, id_t is used instead of t 1 */.

3.
Generate tiled code with the iscc codegen operator relative to relation CODE := ∪ q i=1 CODE i and postprocess it relative to parallel compilable code.
When sets TI ME i are not used for generation of target tiles, the shape of tiles is rectangular because the tiles are formed as the intersection of rectangular subspaces located along m axes. Target tiles are hypercubes of dimension m. When the upper bounds of loop iterators are represented with parameters, the size of each such a hypercube is not limited if m is less than the loop nest depth. Parallelism degree is equal to m − 1 (this is the property of wave-fronting). Table 1. Features of target tiles and target code when the number of positive elements of a common direction vector is m.

Tile and Code Features
Sets T I ME i Used Sets T I ME i Not Used

Shape
Arbitrary; tile surfaces are perpendicular to axes l 1 , l 2 , . . . , l m , in general, and the tile surfaces along the reminding axes can be arbitrary.
Tiles are rectangular.

Size
Limited to the number of instances inside a time slice within the space tile calculated as SPACE = l 1 j=l m SPACE j .
Not limited when m is less than the loop nest depth.

Applying Space-Time Tiling to the Examined Loop Nests
The algorithms presented in this paper are implemented in the publicly available source-to-source TRACO compiler (traco.sourceforge.net (accessed on 1 September 2021)).
TRACO takes on its input C code and reruns on its output parallel target code in the OpenMP C/C++ standard generated by means of space-time tiling.
We applied TRACO to the codes presented in Listings 1-3 implementing the Smith-Waterman algorithm, the counting algorithm, and Knuth's OBST algorithm, respectively.
Parallel tiled codes generated by means of space-time tiling are shown in Listings 4-6. In each code, the first two outer loops enumerate space tiles, the third outer loop scans time slices within each space tile, and the remaining loops enumerate statement instances within each time slice. In each code, the second outer loop is parallel and it implements the wave-front parallelization technique.
The full listing of carried out calculations as well as the target codes are presented at the website http://traco.sourceforge.net/dp/sw/sw_listing.txt (accessed on 1 September 2021).

Experimental Study
In this section, we present the results of an experimental study with codes implementing the SW, Counting, and Knuth algorithms. Tiled codes were generated by means of PLuTo and TRACO, and they can be found at http://traco.sourceforge.net/dp/sw (accessed on 1 September 2021). All parallel tiled codes were generated by means of the Intel C++ Compiler (icc) and GNU C++ Compiler (g++) with the -O3 flag of optimization.
In order to carry out experiments, we used three multi-processor machines: a 2 × Intel Xeon E5-2699 v3 ( For each TRACO and PLuTo code generated, we explored many different tile sizes to find the best one resulting in maximal code performance. For TRACO, we empirically found out that tile size 16 × 16 × 16 allows us to reach maximal performance of all examined codes: the first two dimensions define the size of a space tile, while the third one defines the number of time partitions within a time slice. Under our experiments, for PLuTo 2D codes, the best tile size among all sizes examined by us is 16 × 16. Figure 2 presents the execution times of TRACO and PLuTo parallel tiled codes executed on three machines, 2 × Intel Xeon 2699 v3 (72 threads), Intel i7-8700 (12 threads), and AMD Epyc 7543 (64 threads), for randomly generated sequences of length 1000 to 10,000 and 1000 to 15,000, respectively. As we can observe, the parallel tiled code generated by means of the space-time approach presented in this paper considerably outperforms the one generated with PluTo. Figure 3 shows the execution times of TRACO and PLuTo parallel tiled codes of the Counting algorithm for randomly generated sequences of length 1000 to 10,000. As we can observe, the parallel space-time tiled code also outperforms the one generated with PLuTo for each studied machine. Figure 4 presents the execution times of TRACO and PLuTo parallel tiled codes of Knuth's algorithm for randomly generated sequences of length 1000 to 10,000. Space-time tiling outperforms PLuTo tiling significantly because PLuTo is unable to tile the innermost loop in the Knuth's code. Figure 5 depicts speedups of TRACO and PLuTo codes achieved on a 2 × Intel Xeon 2699 v3 (72 threads) and AMD Epyc 7543 (64 threads) for the length of a sequence, N = 10,000. It is worth noting that, for the Intel Xeon, the speedup of the space-time tiled Knuth's code is greater than 72 (about 114). For the AMD Epyc, the speedup of the SW code is greater than 64. This means that the space-time target parallel codes expose super-linear speedup on the two modern machines.
To summarize, we may conclude that splitting larger unbounded tiles into smaller ones presented with time slices allows us to increase target parallel tiled code locality, which results in increasing its performance for the examined dynamic programming codes.

Related Work
In this section, we discuss well-known tiling techniques and compare them with the technique presented in this paper.
Wonnacott et al. introduced serial 3D tiling of "mostly-tileable" loop nests of Nussinov's RNA secondary structure prediction in [11] to overcome the limitations of affine transformations. However, the authors do not present any method for parallelizing tiled codes.
Mullapudi and Bondhugula [12] have explored automatic techniques for tiling codes that lie outside the domain of affine transformation techniques. Three-dimensional iterative tiling for dynamic scheduling is calculated by means of re-orderable reduction chains to eliminate cycles among tiles in the dependence graph for Nussinov's algorithm. Their approach involves dynamic scheduling of tiles rather than the generation of a static code.
Li and et al. showed how to use array transposition to enable better caching for Nussinov's algorithm [2] by replacing the array reading column order to the row order storing transposed cells in the unused lower triangle of Nussinov's array. However, that approach is restricted to Nussinov's folding only, and it is not clear how other DP algorithms can be optimized.
Sophisticated tile shapes such as diamond and hexagonal tiling are presented in papers [30,31]. The approaches presented in those papers can deal with loop nests exposing affine dependences. However, those tile shapes cannot be applied to other programs other than stencils. They also do not use time slices within a larger tile to form smaller target tiles. For example, hexagonal tiling constructs a hexagonal tile shape along the time axes of a stencil code and the first space dimension and classical tiling along the other space dimensions. The idea for using time slices in order to form target tiles presented in this paper was not considered in diamond and hexagonal tiling.
Diamond tiling is enabled by the PLuTo compiler. We tried to generate diamond tiling for the loop nests discussed in the previous section by means of PLuTo. For each of those loop nests, PLuTo failed to generate diamond tiling.
Loop tiling based on a tile correction technique [32] generates tiles of irregular shapes and sizes [20]. This complicates thread load balancing during tile code execution, and it simultaneously does not guarantee that, for a larger tile, all data associated with that tile are held in the cache. This results in decreasing code performance.
Paper [33] introduces the Multi-Way Autogen framework, which first combines monoparametric tiling of the input iterative DP code with loop-to-recursion conversion in order to obtain a parametrically recursive divide-and-conquer algorithm. Then, it decomposes a loop nest into several pieces in order to expose additional parallelism across loop iterations and across recursive calls. Mono-parametric tiling is based on deriving and applying affine transformations to generate target code. So, it fails to tile the innermost loop in the codes examined in our paper, i.e., the target space tiles are unbounded. In order to allow for parallelism and to improve target code locality, the authors suggest decomposing a loop nest into smaller ones and then recursive calls are used so that all inter-tile dependences are respected. Autogen only considers DP algorithms which have a single-assignment statement in them. Since the paper [33] does not contain any full target codes and does not provide any link to Autogen, we are not able to present any comparisons of the performance of codes generated by means of our technique and the ones introduced in paper [33].

Conclusions
The paper presents a novel approach for space-time loop tiling implemented in the publicly available TRACO compiler. First, for each loop nest statement, subspaces are generated so that the intersection of them results in tiles, which can be enumerated in lexicographical order or in parallel by means of the wave-front technique. Then, within each tile, time slices are formed, which are enumerated in lexicographical order. The approach was applied to the three dynamic programming applications in order to generate parallel tiled code. The results of carried out experiments with that code demonstrate satisfactory code speedup and scalability. For the same original codes, we applied the state-of-the-art PLuTo compiler, which forms and applies affine transformations in order to generate parallel tiled code. We presented the results of the comparison of code performance.
We experimentally discovered that the proposed approach to generate parallel tiled code has an advantage over affine transformation techniques when they fail to tile the innermost loop in a nest of loops that results in the generation of unbounded tiles. In such a case, code locality is poor. Splitting unbounded space tiles into smaller ones represented with time slices within each space tile allows us to increase code locality and preserve enough target code parallelism to be run on modern multi-core computers.
In the future, we will enlarge space-time tiling with more advanced strategies of subspace generation that is not limited to only the rectangular shape. We plan to study spacetime loop tiling relative to more dynamic programming tasks, exposing affine dependence patterns preventing or making only affine transformations inefficient relative to the tiles and parallelizing such tasks. Data Availability Statement: Source codes to reproduce all the results described in this paper can be found at the following: http://traco.sourceforge.net/dp/sw (accessed on 1 September 2021).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: NPDP Non-serial polyadic dynamic programming; SW Smith-Waterman; ATF Affine Transformation Framework.