Detailed Placement and Global Routing Co-optimization with Complex Constraints

: With several divided stages, placement and routing are the most critical and challenging steps in VLSI physical design. To ensure that physical implementation problems can be manageable and converged in a reasonable runtime, placement/routing problems are usually further split into several sub-problems, which may cause conservative margin reservation and mis-correlation. Therefore, it is desirable to design an algorithm that can accurately and efﬁciently consider placement and routing simultaneously. In this paper, we propose a detailed placement and global routing co-optimization algorithm while considering complex routing constraints to avoid conservative margin reservation and mis-correlation in placement/routing stages. Firstly, we present a rapidly preprocessing technology based on R-tree to improve the initial routing results. After that, a BFS-based approximate optimal addressing algorithm in 3D is designed to ﬁnd a proper destination for cell movement. We propose an optimal region selection algorithm based on the partial routing solution to jump out of the local optimal solution. Further, a fast partial net rip-up and rerouted algorithm is used in the process of cell movement. Finally, we adopt an efﬁcient reﬁnement technique to reduce the routing length further. Compared with the top 3 winners according to the 2020 ICCAD CAD contest benchmarks, the experimental results show that our algorithm achieves the best routing length reduction for all cases with a shorter runtime. On average, our algorithm can improve 0.7%, 1.5%, and 1.7% for the ﬁrst, second, and third place, respectively. In addition, we can still obtain the best results after relaxing the maximum cell movement constraint, which further illustrates the effectiveness of our algorithm.


Introduction
In recent years, with the rapid development of integrated circuit manufacturing processes, the geometric dimensions of the integrated circuit have been continuously reduced, and the integration level has continued to increase. Coupled with the limitations of storage space and packaging process limitations, very large scale integration (VLSI) design has increased dramatically. Physical design is one of the key aspects of VLSI design and is the core of electronic design automation (EDA) tools. It mainly includes the following stages: partitioning, floorplanning, placement, and routing [1].
Placement and routing are the most critical and challenging steps in VLSI physical design. It is a typical large scale NP-hard problem which significantly impacts the performance indicators of integrated circuits. To ensure that physical implementation problems can be manageable and converged in a reasonable runtime, placement/routing problems are usually split into several sub-problems: global placement, legalization, detailed placement, global routing, and detailed routing. The global placement stage finds the location for each cell to minimize some performance (for example, the total wirelength) while ignoring some cell overlaps. The legalization stage eliminates all overlaps while maintaining global placement results as much as possible. The detailed placement stage further optimizes the result of legalization by moving cells. In the global routing stage, all nets are routed on a coarse grid map, and the approximate routing of all nets is determined; that is, the routing range is allocated for each net. According to the guide of the global routing result, the detailed routing stage determines the specific routing of each net while all design rules are satisfied.

Previous Works
Detailed placement is a discrete optimization problem which is also crucial to the quality of the placement solution. By legally relocating the movable cells, detailed placement can improve the solution while satisfying some design constraints, such as routing congestion or placement density [2]. One of the most commonly used methods for detailed placement is the sliding window technique. The branch and bound placer [3] reorders adjacent cell groups in a row by the sliding window technique, where the cells are optimally reordered in each window. Another important method is cell matching. NTUplace3 [4] proposes to find a set of exchangeable/independent cells in a given window and formulates a bipartite matching problem by assigning cells to available slots in the window. Cell moving/swapping technique is also a beneficial and effective method for detailed placement. FastPlace-DP [5] moves/swaps cells to their optimal location without overlapping and changing other cells. After finding the optimal region, the cell is exchanged with other cells or white space in the optimal region. The overlap penalty is estimated by the distance that shifts the surrounding cells to a legalized position. The difference between the total wirelength before and after the exchange and the penalty charged on the increasing overlap is a measure of selecting the cell or space in the optimal region. In addition, some detailed placers are trying to improve the routability while reducing the wirelength. For example, RippleDP [6] uses congestion-aware FastPlace-DP to avoid swapping/moving cells to possible routing congestion regions. After moving cells to the optimal HPWL regions, the locations can be locally improved by inter-row moves, cell reordering, and compaction. However, these methods are seldom considered routability, and there may still be greater congestion in the subsequent global routing stage.
Traditionally, global routers route a path for each net on a fixed placement result of detailed placers. There are two strategies to performing the global routing process on the 3-dimensional structure. One is to solve the routing problems on the 3D routing grids directly. FGR [7], which is based on the discrete Lagrange multipliers technique, can obtain a good 3D routing result at the cost of an extremely long runtime. GRIP [8] applies integer programming to minimize wirelength simultaneously and via cost without a layer assignment phase. GRIP also consumes too much runtime to be practical. Recently, CUGR [9] makes great use of the 3D structure of a grid graph with a probability-based cost scheme, 3D pattern routing, and multi-level 3D maze routing. The other approach is to transform the 3D routing grids into 2D grids. FLUTE [10] is conventionally employed to decompose each multi-pin net into a set of two-pin nets to generate an initial solution. After performing 2D global routing, 2D solutions are extended to 3D solutions with layer assignment techniques. Most global routers adopt this two-step routing strategy and achieve high-performance routing results, such as NCTU-GR 2.0 [11], FastRoute 4.0 [12], NTU-GR [13] and NTHU-Route 2.0 [14]. However, these routers consider routes on a fixed placement result which does not allow cell movement. Thus, global routing information can no longer be fed back to the placement to optimize the wirelength further.
However, this divide-and-conquer approach may cause information asymmetry between sub-problems. For example, a placer should systematically guide a router to avoid congestion and achieve high routability by considering cell density or pin density. But cell density or pin density of the placement stage may not accurately depict the actual track density of the routing congestion problem. To bridge the gap between placement and routing, previous works on IPR [15], GRPlacer [16], CRISP [17] and FastRoute [18] all combine a fast global router within their placer to offer accurate wirelength estimation. SRP [19] considers routing and placement simultaneously based on a given placement and global routing result to relocate cells that obstruct routability. The work [20] proposes an ILP-based cell movement to move cells and route nets at the same time after global routing. In the work, it chooses the median point of all the cells in the connected nets as the candidate location, and constructs the integer linear programming (ILP) model according to the possible routings. In the model, the cells that do not belong to the same net are allowed to move at the same time. By dividing the region, it can reduce the size of the ILP model and take benefit from parallel processing of the independent areas. The wirelength can be improved significantly, even when only 2% of the cells moved. However, there are two major drawbacks of their proposed algorithm: (1) the runtime of ILP is sensitive to the quality of the initial solution according to their experimental results, so that an inferior initial routing solution and placement can cause much more runtime in their algorithm; and (2) their method has poor scalability due to the high complexity of solving ILP, and the method is also time-consuming, even when only 2% cells are moved and the problem is handled region by region.
Furthermore, to alleviate the misalignment between placement and routing, the 2020 ICCAD [21] held a CAD contest called routing with cell movement that detailed how placement and global routing could cooperate to optimize the routing length further. Cell movement is allowed during the global routing process instead of routing a path for each net on a fixed placement result. Namely, within the time limited in the contest, this global router can move certain cells from one grid to another if all the given routing constraints can still be satisfied while the wirelength can be further reduced. These make the problem more complicated, and how to solve this problem efficiently is a huge challenge. The work [22] proposes an incremental 3D global routing engine considering cell movement and complex routing constraints to relocate cells and reroute nets. Firstly, Ref. [22] uses a congestion-aware 3D global router to reconnect all the pins of each net with minimized wires and vias. Then, the wirelength-driven movement evaluation method is proposed to find the desired locations for movable cells. Finally, cell-movement-driven incremental routing moves and routes all candidate positions in parallel and determines the desired routing paths that achieve the minimum routing resources without any routing violation.

Our Works
In this paper, we propose an effective cell movement method with efficient incremental routing, which can co-optimize the detailed placement and global routing simultaneously to get the optimal solution. The main contributions of our work are summarized as follows: • We propose an improved batch scheduling method which can increase the speed of scheduling the net into disjoint batches by 70× in this contest. Further, by combining FLUTE and maze routing, we propose a fast and effective preprocessing and refinement strategy; • To find a proper destination for cell movement, a BFS-based approximate optimal addressing algorithm in 3D is designed. Further, we propose an optimal region selection algorithm based on the partial routing solution to jump out of the local optimal solution; • According to the requirements of our work, four partial rip-up strategies for routing length optimization are presented to make a trade-off between quality and efficiency.
Unlike previous works, we present a new routing cost function to consider this problem better. In addition, to improve the rerouting efficiency, we use the A* and the multi-source multi-sink maze routing algorithms to perform partial rerouting operations jointly; • Compared with the top 3 winners according to the 2020 ICCAD CAD contest benchmarks [21], experimental results show that our algorithm achieves the best routing length reduction for all cases with a shorter runtime. On average, our algorithm can improve 0.7%, 1.5%, and 1.7% for the first, second, and third place, respectively. In addition, we can still get the best results after relaxing the maximum cell movement constraint, which further illustrates the effectiveness of our algorithm.
The remainder of this paper is organized as follows. Section 2 describes the problem statement and our algorithm flow. Section 3 gives the preprocessing scheme of the initial routing result. Section 4 introduces our partial rip-up, destination selection and partial reroute algorithm. Section 5 presents our refinement approach. Section 6 shows the experimental results. Finally, conclusions are made in Section 7.

Problem Description
In the detailed placement stage, the placement result is usually improved by moving or swapping cells while maintaining the legality between cells. In this paper, we consider the cell movement problem with the given placed and routed design which was presented in the ICCAD'20 CAD Contest [21]. In this problem, routing resources, including pins and nets, are typically abstracted as a 3D grid graph called gGrids (global grids), where the cell movement and 3D routing can be operated on gGrid. The number of rows N r and columns N c of the gGrids for all the routing layers is the same and given. The number of routing layers is given as N l , and via (vertical interconnect access) is simply modeled as z-direction routing.
The capacity c(u) is defined as the maximum number of routing tracks that can cross the gGrid u. With the given capacity value of the gGrid on each layer, the capacity of some certain gGrids will be increased or decreased based on the default value. Traditionally, the demand d(u) is defined as the actual number of routing tracks crossing the gGrid u. In this problem, the demand d(u) of a gGrid u would be the summation of four parts, i.e., routing segments demand, all blockage demands, extra demand in the same gGrid and extra demand in adjacent horizontal gGrid(s). (1) Routing demand could be calculated as the number of nets which has routing segment in this gGrid. It should be noted that the number of routing segments in a net crossing one gGrid has no additional effect on the routing demand (must be one demand); (2) Blockage demand of the belonging cell will be added to the grid where the cell located, and will change as the location of the cell changes; (3) When a certain pair of cells is placed in the same gGrid, it would need an extra demand for this gGrid; (4) When a certain pair of cells exists in adjacent horizontal gGrids, these two adjacent gGrids would both need extra demand. Congestion happens when the demand d(u) exceed the capacity c(u) assigned to the gGrid u. The resource r(u) is defined as the difference between the routing capacity and demand, i.e., r(u) = c(u) − d(u). If r(u) < 0, it indicates insufficient resources in gGird u, which is called routing overflow.
According to the given initial global routing result and the circuit netlist N, the movable cells can be moved from one gGrid to another, and thus can re-connect the broken routing paths incrementally for connected nets with all the given routing constraints satisfied while the total routing length is minimized. The routing length is calculated by the number of gGrids that all nets span (the number of vias is the same as routings in other directions). The given routing constraints of the problem that should be satisfied are listed as follows. Maximum cell movement constraint C3: In order to maintain information of the given placement results and avoid generating completely altered placement results, the total number of moved cells during the cell movement should be constrained to 30% among all cells; • Net-based minimum layer constraint C4: The net e j may have a minimum layer routing constraint min l,j . The pins whose z-coordinate are smaller than the minimum layer constraint need to be connected to the minimum layer through vias, and further, the H/V-direction routing of this net will be only on or above the given minimum layer; • Layer routing direction constraint C5: The routing direction is horizontal on the first layer M1, and it is different on any two adjacent layers. In other words, H/V-direction routing must route on the odd/even layer, respectively. Figure 1 shows the overall flow of the proposed approach, which consists of three major stages: Rtree-based fast preprocessing, incremental rerouting with cell movement, and routing length driven refinement. In the preprocessing stage, we first present improved scheduling for parallel routing based on R-tree. After that, a greedy selection strategy is used to accept the solution with routing length reduction. During the incremental rerouting with cell movement stage, four partial rip-up strategies are proposed to make a trade-off between quality and efficiency while removing the cell. According to different partial rip-up strategies, a BFS-based approximate optimal addressing algorithm in 3D and an optimal region selection by partial routing solution are proposed to find the candidate destinations of the removed cell. A partial rerouting algorithm hybrid A* and multi-source, multi-sink maze algorithm is proposed to find the optimal destination of cell movement in parallel. Finally, an efficient refinement is adopted to reduce the routing length further.

RTree-Based Fast Preprocessing
In the global routing stage, the complex net structure, unreasonable routing or infeasible ripping-up result in closed loops and needless nodes. The redundant routing result increases the routing length and makes the region congested. Firstly, we mark all the net points in the bounding box as unvisited, and the topology of the tree will be built in Figure 2a. A-F in Figure 2 are the grids where the net passes. Secondly, the depth-first search (DFS) technology is used to mark the visited nodes in Figure 2b, and the nodes that have no pin will be removed in the process of backtracking in Figure 2c,d. After the above operations, the closed loops of nets will be broken, and the redundant nodes will be deleted. More importantly, the routing length and congestion will be significantly improved.
According to the bounding boxes of the given initial routing result, we build R-trees [23] and later query nets with the disjoint border from the R-trees. Similar to [24], we propose the scheduling of all the batches in our work by Algorithm 1. Since conflicts are more likely to occur between large nets, line 1 sorts all nets in decreasing the size of the bounding boxes. Nets are assigned one after another by joining an existing batch or building a new batch (lines 2-18), thus minimizing the number of batches. R-trees are used to judge the overlap between a net bounding box and a candidate batch. In application, we found that most of the R-tree queries in the later stage of the original algorithm failed, which caused a lot of time to be wasted. Therefore, lines 9-11 added some criteria to judge whether enough nets have been added to a batch. Since nets with shorter wirelength have a smaller solution space and a larger number of pins makes it difficult to route, line 19 reordered the batchlist. In this way, the total scheduling runtime can be improved by 70× (detailed comparisons are shown in the Section 6.2). Figure 3 shows an example of our scheduling, where red and green rectangles represent different batches, respectively.

Input:
Nets; Output: BatchList; 1: Sort all nets in decreasing size of the bounding boxes; 2: for each net e i do 3: for each batch b j in BatchList do 4: if batch b j is full then 5: continue; 6: end if 7: if the bounding box of e i has no overlap with b j then 8: Add e i into b j ; 9: if nums(b j ) ≥ n b or A cur /A total > t then 10: b j ← full; 11: end if 12: break; 13: end if 14: if e i has not been assigned to any batch then 15: Build a new batch and added e i ; 16: end if 17: end for 18: end for 19: Sort the batchList with shorter wirelength and a larger number of pins; Compared with the congestion-aware 3D global routing in the work [22], we use a greedy method of mixing FLUTE and maze routing in each batch to optimize the initial solution. Our greedy preprocessing algorithm for the initial global routing result is shown in Algorithm 2. Firstly, line 4 uses a very fast and accurate rectilinear Steiner minimal tree (RSMT) algorithm called fast lookup table estimation (FLUTE) [10]. A net-breaking technique is used for high-degree nets to reduce the net size until the table can be used. In addition, an edge shifting technique is used to direct routing demand away from the congested regions by moving some tree edges without increasing wirelength [25] in line 5. After that, all Steiner trees are broken into 2-pin nets, which are better results in 2D layout. Thus, we use L-shaped pattern routing and layer assignment to rapidly get a reasonable 3D routing result (lines 6-8). During multi-layer global routing, Ref. [26] adopted dynamic programming to find a layer assignment result such that the via cost is minimized while the given congestion constraints are satisfied. Lines 9-12 accept the result if the solution has no overflow and is shorter than the initial result. Otherwise, we use maze routing [25] to reroute the whole net in the 3D boundary. Maze routing is the most popular and powerful technique in global routing to find a path while avoiding congestion. According to some cost functions, maze routing facilitates the shortest path connecting two pins through the fewest congestion grids. The cost function will be introduced in Section 4.3.

Incremental Rerouting with Cell Movement
In this section, we introduce our partial rip-up, destination selection and partial rerouting algorithm. The specific process is as follows. Firstly, we calculate the wirelength of the bounding box that can be reduced by moving it to its optimal region [5], and reorder by the decreasing order. For each cell, we rip up the connected nets partially and find the candidate destinations. For each candidate destination, we first update the extra demand and check that there is no overflow constraint C1, and thus reroute the remaining routing paths and the destination gGrid to obtain the reduced routing length. Since, at most, only one destination will be selected, these rerouting processes can be processed in parallel.

Partial Net Rip-Up with Cell Removal
For cell movement, the nets which connect to the pins of removed cells relocated need to be ripped up before being re-routed. However, dismantling the entire net inevitably brings a lot of unnecessary recalculation because some parts of the nets are not directly connected with the pins of the removed cell or away from the congested region. It is time-saving and computation-reducing to retain some parts of the net which have little effect on the re-routing of the relocated cell. Therefore, under different conditions, we propose a novel method to achieve the different reuse of parts of routing paths which is suitable for our problem. This method is more comprehensive than the previous works [22] and SRP [19], where [22] would keep the remaining wires in one connected component and [19] do not consider the impact of the Steiner points. For convenience, we introduce these schemes in this section. When we delete the routing path connected to the pin on the cell to be moved, four cases will be considered as follows (the detailed illustration is shown in Figure 4). For simplicity, we only show the case of the 2D rip up. For a 3D case, the via above the minimum layer are treated as normal paths, and the via below the minimum layer may be removed as the pin is removed (the via which is used by other pins can be still reserved). Assuming that there are n nodes in the single net, by recursively traversing the nodes, we can dismantle the unwanted part of the net within O(n) time complexity.
Case R1: In Figure 4a, grid (7, 3) contains two pins. After removing the red removed pin, we would not rip up any paths connecting to the grid, as in Figure 4b. If there is the minimum layer constraint on this net, and there is a via in this grid, only the via of another pin is reserved.
Case R2: In Figure 4c, grid (0, 3) only contains a removed pin, and we delete the connected paths from this pin until we reach a grid that contains a pin or a Steiner point, as in Figure 4d. In this case, the remaining paths may be divided into multiple subnets which are equal to the degree of this pin. We will connect these subnets and the relocated pin after finding the new location.
Case R3: In Figure 4e, grid (0, 3) contains a removed pin whose degree is larger than one. When the remaining path need to be connected (which is discussed in Section 4.2.1), we would not rip up any paths connecting to the grid, as in Figure 4f. Case R4: In Figure 4g, grid (3, 0) contains a removed pin, and we delete the connected paths from this pin until reaching a grid that contains a pin or until the second passed Steiner point, as in Figure 4h. Compared with case R2, this case will destroy the local topology, and we believe that the construction of the first passed Steiner point will largely depend on the position of the removed pin. For example, if the removed pin is located at the grid (3,4), the Steiner point in the grid (3,3) would not guarantee the shortest length of the net.

Destination Selection of Cell Movement
In our work, we select one cell to remove each time, and find an optimal moving position to obtain the maximum routing length reduction. Different from the previous work [19], the purpose of SRP is to optimize routability. In this contest, we need to optimize the routing length as much as possible without causing routing overflow. In order to achieve this goal, we propose the following two candidate destination selection schemes.

BFS-Based Approximate Optimal Addressing Algorithm in 3D
To reduce the number of routing operations, which is extremely time-consuming, we need to approximate the routing process as accurately as possible to select the destination. Based on some routing constraints (layer direction, minimum layer, via reused, overflow), we propose a breadth-first search (BFS)-based approximate optimal addressing algorithm in 3D in Algorithm 3. In this algorithm, we divide the routing range into two parts. The part on the minimum layer uses a 3D search strategy and directly performs calculations below the minimum layer to significantly reduce the number of search calculations.
Obviously, if the cell is moved beyond the outer border of its current routing paths, the routing length is almost impossible to reduce. The range [x l , y b ] × [x r , y t ] is obtained by the bounding box of all paths in the connected nets E i . For each net e j ∈ E i , lines 5-9 first execute different rip-up strategies according to different situations to get the remaining paths. Since multiple subnets are searched together, it is challenging to ensure efficiency while considering the Steiner points. Therefore, if the net e j has another pin in the same grid (x i , y i , z min j ) (z min j denotes the pin coordinate in the z-direction on the minimum layer in net e j ), or the degree of this pin is greater than one, we need to ensure that each remaining path is still connected in this method (see Figure 4b,f). Otherwise, we delete the connected paths from this pin until reaching a grid that contains a pin or a Steiner point. Furthermore, line 10 calculates the routing length of the removed paths rl j and the removed via ∆via {j,(x i ,y i )} , which has no overlap with other via in this net.
For each net e j , the z-range z b , z t is obtained by the bounding box of the z-direction of e j on the minimum layer where z t must larger than z b for the different layer directions (if the congestion is severe, we extend z t = z t + 1). Lines 12-16 add the remaining paths to the queue q and mark them as visited, and the cost dis j of the gGrid p is 0. Lines 17-30 pop the gGrid in the queue one by one and search the adjacent gGrids according to the direction of the layer. If an adjacent gGrid is un-visited, mark it as visited and increase its cost by 1, and then add the gGrid to the queue except for the demand equal to the capacity (which means that no path can pass through this gGrid). Repeat the operation within the given search range [x l , y b , z b ] × [x r , y t , z t ] until the queue is empty. Finally, line 31 takes the cost of the layer dis j,z min j where the pin z min j is located. Then, we add the length of the required via to each destination, and deduct the length of the overlapping part if the via of other pins can be reused. After searching for all the nets E i , the cost dis(x, y, z) represents the total routing length that the cell moves to the destination (x, y, z), and rl is the total rip-up routing length. Even if we consider the congestion area where a single net cannot pass through in lines 24-26, there may be multiple routing paths passing through the area close to overflow at the same time, which results in the actual routing length being larger than dis(x, y, z). Therefore, when dis(x, y, z) is less than rl, line 35 adds grid(x, y, z) to the priority queue C.

Algorithm 3
The 3D BFS-based Approximate Optimal Addressing Algorithm.

Input:
Removed cell i, the connected nets E i , rip-up routing length rl = 0; Output: Candidate destinations priority queue C; 1: x i , y i ← the origin location of removed cell i; 2: x l , x r ← the left, right border of all paths in nets E i ; 3: y b , y t ← the bottom, top border of all paths in nets E i ; 4: for net e j ∈ E i do 5: if another pin in grid(x i , y i , z min j ) or degree > 1 then 6: ripupSet j ← keep all paths by R1 or R3; for p ∈ ripupSet j do 13: q.push(p); 14: visited(p) ← true; 15: dis j (p) ← 0; 16: end for 17: while q = ∅ do 18: for grid cur ∈ q do 19: grid cur .pop(); 20: for grid adj ∈ direction(grid cur , z cur ) do 21: if grid adj ∈ [x l , y b , z b ] × [x r , y t , z t ] && !visited(grid adj ) then 22: visited(grid adj ) ← true; 23: dis j (grid adj ) ← dis j (grid cur ) + 1; 24: if d(grid adj ) < c(grid adj ) then 25: q.push(grid adj ); Add (x, y, z) to C;

36:
end if 37: end for An example of this algorithm is shown in Figure 5. In the figure, we want to search for the candidate destination of the removed pin whose z-coordinate is M1 and minimum layer is M5 within the bounding box of the existing routing path. In Figure 5a, the red and green lines represent the routing path on the minimum layer and via, respectively. In our algorithm, we separate the routing region by minimum layer to improve efficiency. On the minimum layer, we set the routing paths in Figure 5a to 0, and use BFS to find the distance while the layer direction is satisfied in Figure 5b. After that, since the minimum layer constraint for the removed pin is M5, we take M5's distance map and add the length of the required via to each gGrid in Figure 5c. In particular, when the via can be reused, only the length of the newly added non-overlapping part needs to be added. In this algorithm, each net in E i is unrelated and can be processed in parallel at the same time. In addition, dividing the search range according to the minimum layer can reduce a lot of search space, which makes the algorithm more efficient. Different from the distance formulation in the previous work [22], our direct search method is closer to the real routing process. Even though our method has spent more time than [22], a more accurate destination selection may reduce the time for subsequent reroutes.
Here we analyze the complexity of the algorithm. For each net e j , there are V = (x l − x r ) × (yt − yb) × (z t − z b ) gGrids in the search region.

Optimal Region Selection Using Partial Routing Solution
In the previous section, due to the limitation of the search method, we require the remaining paths to be connected. This will cause the new destination of the cell to depend on the previous topological structure, and it is easy to fall into a locally optimal solution. In most cases, the structure of the first Steiner point directly connected to the removed cell is largely related to the cell location. Therefore, we adopt the R4 strategy, which deletes the connected paths from this pin until reaching a grid that contains a pin or the second passed Steiner point in the hope of constructing a better topology according to the new location of the cell. In this case, a net may be divided into multiple disconnected subnets. Therefore, we improve the optimal region technique in the previous work [5] to find the candidate destination of the cell.
In the previous work [5], if only one cell i is allowed to move, the region with the optimal wirelength after placing the cell is defined as the "optimal region" of this cell. This region is determined by the median idea in the work [27]. As shown in Figure 6a, we show the optimal region obtained by this method. For the movable cell i, we traverse all the connected nets and find their bounding boxes (not including this cell). For each net j, the left, right, lower and upper boundaries are denoted by x l , x r , y l , and y u , respectively. In the figure, there are three nets connecting to cell i. There are 5, 4 and 3 cells (denoted by diamonds) in net 1, 2 and 3, respectively. The bold dotted boundary boxes are the bounding boxes for the nets excluding cell i. From [27], the optimal region [x r 2 , y l 2 ] × [x l 3 , y u 2 ] is given by the medians of the x−series (x l 1 , x l 2 , x r 2 , x l 3 , x r 3 , x r 1 ) and y−series (y l 3 , y u 3 , y l 2 , y u 2 , y l 1 , y u 2 ) of the bounding boxes. At any gGrid in this optimal region, the sum of the distances to the bold dotted boundary boxes is equal and smaller than the other gGrids.
In the previous work, the optimal region was only related to the cell's position, and the minimum estimated wirelength may have a large gap with the actual routing length. In this work, we have identified the cell's position as well as the actual routing solution. The information of routing paths usually contains routing constraints, such as layer direction and congestion. For example, in Figure 6b, we consider the routing paths on the basis of Figure 6a. In the figure, the straight line represents the remaining paths, and the dashed line represents the removed paths while removing cell i. In net 1, the cells at y u 1 are routed downward instead of connected as a horizontal line because they are affected by the layer direction constraint. In this case, the optimal region is [x r 2 , y l 1 ] × [x l 3 , y l 2 ], which is smaller than the region in Figure 6a. The best moving destination of the removed cell are all (x r 2 , y l 2 ) in both two figures. Even in some complex situations, the original method may miss the correct location. In particular, we prioritize the gGrids such that the via can be reused in the optimal region. In general, this improved method can consider routing constraints as much as possible, and the runtime will not be increased while optimizing the results.  Figure 6. The optimal region obtained by: (a) the method presented in the work [5]. (b) the improved method in our work.

Partial Rerouting by A* and Maze Routing Algorithms
A complete routing tree is built by re-routing the several disconnected sub-nets together. Before proposing the routing algorithm, we first give our cost function and briefly explain some basic routing operations in our algorithm. In our problem, the via is simplified to route on the z-direction. Thus, the cost function presented in the work [9] is shown as follows: where wl(u) is the wirelength cost, and the function on the right side forms the congestion cost. d(u)/c(u) and r(u) represent the possibility of overflow and the resource, respectively. α determines the ratio of the congestion term, and variable β of the logistic function determines the global router's sensitivity to overflow. In this problem, there is already a legal initial routing solution, and the objective of rerouting is to reduce the routing length without causing routing overflow. In order to make the solution easier, we use multiple iterations for routing every time until we get a solution without overflow. The cost function in our work is modified as follows: where iter ∈ {1, 2, 3} is the iteration in the routing process and γ is a penalty factor to avoid routing through the gGrid that is about to overflow. θ is a positive integer that controls the available capacity. We remove d(u)/c(u) because none of the grids overflowed (must be d(u) ≤ c(u)) in this problem. To reduce the routing length as much as possible, we should not treat gGrids differently as long as there are sufficient resources. We only need to avoid crossing the gGrid where the demand is close to capacity. To avoid unnecessary searches, we only rerouted inside the bounding box that the origin routing path of the net passed through at the beginning. Since the pins are usually on the lower layer, the higher metal layers are usually not used for 3D routing. Therefore, the congestion of the lower layers would be greater than that of the higher layers. If a solution without overflow can be found, we expand the search range in the z-direction as the iteration increases. We do not expand in the x, y-direction because we prefer to route with less congestion when the same routing length would be increased.
Among the current global routing tools, the more popular one is maze routing with multiple sources and multiple sinks. In this problem, the goal is to connect the removed cell and multiple subnets (in most cases, no more than 3). We use multi-source multi-sink maze routing [25] to generate good routing solutions for the multi-pin nets. The time complexity is O(VlogV), where V is the gGrid points in the search region. This method considers the existing routing tree instead of restricting the two endpoints of the routing path to be the original endpoints of the edge being routed. We treat the removed cell as the source, and all the gGrid points on the remaining paths as sinks. Similar to Dijkstra's algorithm, when a gGrid point is extracted from the priority queue, the cost is the shortest distance from sources to this gGrid point. Once a gGrid point in a sink is extracted from the priority queue, new sources are constructed together with old sources, the shortest paths, and the encountered subnets. The search process is performed again until all the gGrid points are connected.
However, in our work, the difference is that we only partially rip up the net. For example, we adopt the R2 rip up strategy to reroute the candidate destinations obtained in Section 4.2.1 (the worst case is to connect the disconnected subnets according to the R3 situation, and then connect to the target gGrid. Therefore, it will be better than the estimated result); for the candidate destinations obtained in Section 4.2.2, the R4 rip up strategy is adopted to reroute. For the case where a cell will connect to a subnet, we can use the A* algorithm [28] to improve efficiency. The A* algorithm has been applied to global routing [29]. The A* algorithm is the most effective direct search method for solving the shortest path in a static road network, and it is also a practical algorithm for solving many search problems. If the estimated distance value is closer to the actual value in the algorithm, the search speed is faster. In our method, we use the priority queue to select the gGrid (x, y, z) with the current lowest cost, and then use the following heuristic function Cost astar (3) to guide the search direction of the algorithm: Cost astar = Cost predict + (Cost cur + Cost step (x, y, z)), where Cost predict , Cost cur and Cost step (x, y, z) represent the minimum cost estimate to the target gGrid, current cost, and the step cost with the current gGrid to the next gGrid, respectively. If Cost predict is smaller than the actual routing length, the optimal solution can be obtained while the search range is large and the efficiency is low. If Cost predict is equal to the actual routing length, the search efficiency at this time is the highest, and the solution is optimal. The Cost predict is estimated by 3D distance estimation. In the x and y direction, the distance estimation is carried out by the Manhattan distance between the current gGrid and the target gGrid. In the z direction, the distance is estimated by the following equation.
The result estimated is the minimum routing result that satisfies the layer direction constraint, which must be no greater than the actual routing result. Therefore, while ensuring the quality of the solution, it can ensure that the search is carried out in the direction of the target point, which is obviously better than the directionless search of Dijkstra's algorithm. A simple illustration of the 2D routing process is shown in Figure 7, which is similar when extended to 3D. In the figure, red points represent the subnet and the removed pin to be connected, yellow rectangles are obstacles where demand is equal to capacity, and green points represent the grid traversed during the search process. In our algorithm, we control the search range within the bounding box of the existing paths. Different from the complete search of Dijkstra's algorithm in Figure 7b, the A* algorithm is directional, which can reduce a large number of unnecessary searches, as in Figure 7a.

Routing Length Driven Refinement
When the number of movable cells reaches the prescribed maximum number, we stop looking for the cells that need to be moved. However, due to the movement sequence, some cells that have already moved can be optimized again. In addition, due to the partial rip-up and reroute of the net in the above section, some of the nets may not have the optimal topology. Therefore, in this section, we further optimize the results. In this stage, θ in Equation (2) is equal to one, using all capacity as much as possible.
If a cell that has been moved is encountered, it will move again. Therefore, we propose a similar but faster 2D BFS scheme to move the cell in this section. Similar to the process of Algorithm 3, we ignore some routing constraints and perform a breadth-first search in a 2D range. The distance of the z−direction is replaced by the minimum distance between the subnet and the removed pin. If these are on the same layer and not on the same straight line, the distance in the z−direction is 2. After considering the reuse of the vias, the estimated distance for the cell to move to any point in the range is obtained. Since this strategy ignores some routing constraints, the obtained candidate locations will be slightly more than the 3D search. At this stage, there are fewer destinations where the cell can move with the reducing routing length. We adopt the R4 rip-up method and set the termination condition of rerouting as long as there is a location that the length can be reduced.
After that, we reroute each net to get a better topology, as in Algorithm 4. In the algorithm, line 4 first reroutes with the FLUTE, which is shown in Algorithm 2, lines 4-8. If the number of pins does not exceed 9, FLUTE usually find the optimal solution. Otherwise, even if the FLUTE solution Sol f can achieve a smaller length than the initial solution, we still use maze routing to get a solution Sol m . In lines 14-18, when the minimum length of these two solutions is larger than the initial solution, we restore the initial routing state. Otherwise, we choose the solution which has a smaller routing length. It should be noted that when the rerouted is unsuccessful or the solution has routing overflow, the routing length rl is set to be INT_MAX.

Experimental Results
In this section, we first introduce our experimental setup and benchmarks. Then, we study the parallel technology used in this paper to show its impact on performance. After that, we compare our results with the top 3 winners of the ICCAD'20 CAD contest. Finally, we change the maximum cell movement constraint to demonstrate the performance of our proposed algorithm further.

Experimental Setup and Benchmarks
We implemented our routing with the cell movement algorithm in the C++ programming language on a 64-bit CentOS Linux workstation with an Intel(R) Xeon(R) CPU E7-4820@2.00 GHz, 128 GB memory, and 8 threads. All the experiments were based on the benchmark suite of the CAD contest benchmarks from ICCAD 2020 [30]. Table 1 shows the statistics of the released benchmarks, where "#gGrids", "#Layers", "#CellInsts", "#Nets", and "Initial #Routes" represent the number of gGrids, routing layers, cells, nets, and the initial routes, respectively. "Initial Length" denotes the total routing length of the initial routes. "Max Move" is the maximum cell movement constraint, which is limited to 30% among all cells in the contest. In these benchmarks, the scales of case1 and case2 are too small and are only used as initial examples in the contest, so that subsequent experiments will exclude these two examples.

Parallel Technology
In this subsection, we study the parallel technology used in this paper to show its impact on performance. Firstly, we show the comparison results of the simultaneous maze rerouting for all nets with the batch scheduling strategy in Table 2. In the table, "RL-Red.", "B-Times", and "R-Times" denote the routing length reduction, the batch scheduling runtime (seconds), and the routing runtime (seconds), respectively. The difference between the improved batch scheduling and the original method in [24] is shown in Algorithm 1, line 9, where n b , t would be chosen by 24 and 0.5. On average, our parallel rerouting can achieve 2.629× faster routing runtime compared with the serial rerouting, and the improved batch scheduling strategy speeds up the origin process by 73×. As the number of nets in each batch increases, the routing length decreases because the ordering of nets is destroyed, which also reduces the routing efficiency. In Section 4.2.1, we proposed a 3D, BFS-based approximate optimal addressing algorithm to find the candidate destinations for the relocated cell. According to the minimum layer constraint, the space is divided into upper and lower parts in our algorithm. The upper part uses the search strategy, and the lower part is directly calculated. In addition, we assume that each net is not related to each other, so it can be parallelized. In Figure 8, "M1" represents the method of directly searching for the layer where the pin is located, and "M2" represents our algorithm. In the figure, we can see that the parallel operation of the connected nets can reduce the running time by about half. In addition, our method can achieve different degrees of efficiency improvement according to the proportion of the minimum layer which occupies the layers that the net passes through. This method of dividing the routing range into two parts according to the minimum layer constraint is also applied to our routing algorithm. In Section 4.2, the routing length reduction by each cell move is more significant in the early iterations, which also means that there are a large number of candidate destinations. Therefore, we select at most the first n s candidate destinations with lower cost. For example, we will get a priority queue that estimates the routing length reduction in the 3D BFS-based approximate optimal addressing algorithm. Only gGrids with costs greater than 0 will be added to this priority queue. If the number is greater than n s , only the first n s items will be taken. In the optimal region selection algorithm, if the number of optimal regions is greater than n s , we give priority to locations where the vias can be reused or have enough r(u). This method is similar to the top-k candidate positions in work [22]; the difference is that our available candidate destinations may be less than n s . To obtain the trade-off between solution quality and runtime, we set n s as 8/16, as the number of gGrids is larger/less than 40,000 in this work. These n s destinations can be rerouted in parallel, and finally, the destination with maximum routing length reduction is selected.
In the entire algorithm, the more time-consuming operations mainly include preprocessing, rip-up, destination selection, partial rerouting for routing length estimation, restoration (routing length is not reduced)/actual routing (routing length reduction), and refinement. Parallel technology can be used in some operations, but there are still certain bottlenecks. For example, in preprocessing and refinement, it is possible to divide the area and thus perform rerouting simultaneously, but it is difficult for the large nets that occupy the primary rerouting time to be independent of each other. In the destinations selection, we can search for different nets simultaneously. However, the number of nets connected to each cell is usually not very large, and the time is mainly affected by the nets with the most search layers in the z-direction. In partial rerouting for routing length estimation, compared with the number of threads, the candidate gGrids are not too numerous, and this value will continue to decrease as the number of moved cells increases. Therefore, the time mainly depends on the gGrid with the longest rerouting time. Combining the above-mentioned technologies, we show the impact of our parallel technology on performance in Figure 9. As the result, our proposed algorithm can obtain an average speedup of 2.15× by using 8 threads.

Comparison of Results with the Top Three Winners
To demonstrate the performance of our proposed algorithm, we compared it with the top 3 winners of the 2020 ICCAD CAD contest [21]. In this contest, the evaluation score is calculated by summating the routing length reduction of all the nets. The ranking of this contest is based on the summation of the score, while the runtime is limited to 1 hour for each case. Table 3 shows the comparison results of the total routing length reduction and runtime between our algorithm and the top three winners. In the table, "RL-Red.", "Times", and "Normalized" represent the routing length reduction, runtime for seconds, and the normalized ratios based on our algorithm. The best result for each benchmark is marked in bold. As shown in the table, our algorithm has achieved the best results in all released benchmarks. On average, our algorithm demonstrates improvements of 0.7%, 1.5%, and 1.7% for the first, second, and third place with the comparable runtime, respectively.

Results with Relaxed Max Cell Movement Constraint
In this contest, most of the constraints are hard constraints; that is, a legal routing result cannot be produced if they are violated. In practical applications, the maximum cell movement constraint C3 may not be necessarily limited by 30%. The 2020 ICCAD contest also gives a reduced routing length by changing the limited maximum cell movement to 0%, 5%, 10%, 30%, and 100%, respectively. Since the contest does not report runtime, we compare the routing length reduction in Figure 10 while the runtime of our results is satisfied within a 1 hour limitation. The black, green, red, and blue colored lines in the figure represent our method and the top three winners, respectively. The horizontal axis is the different percentage of maximum cell movement, and the vertical axis is the routing length reduction. As can be seen from the figure, the black line representing our method is always at the top among all lines. This not only illustrates the effectiveness of our routing algorithm but also our cell movement strategy.

Conclusions
To resolve the conservative margin reservation and the mis-correlation problem in the divide-and-conquer place and route approach, we design an effective and efficient algorithm to co-optimize the detailed placement and global routing with complex routing constraints. A fast preprocessing technology based on R-tree is presented to improve the initial routing results. During destination selection of cell movement, we propose a 3D, BFSbased approximate optimal addressing algorithm and an optimal region selection using the partial routing solution to find the required locations. A hybrid A* and multi-source, multisink maze rerouting algorithm is proposed to find the final destination of cell movement in parallel. The experimental results show that we can obtain the best results with any maximum cell movement. Furthermore, with more advanced manufacturing processes, the constraints continue to increase, such as voltage area constraints, R/C characteristics in different layers, and the timing-based net weight. Our proposed algorithm can be effectively extended to address these problems.