The Impact of 3D Stacking and Technology Scaling on the Power and Area of Stereo Matching Processors

Recently, stereo matching processors have been adopted in real-time embedded systems such as intelligent robots and autonomous vehicles, which require minimal hardware resources and low power consumption. Meanwhile, thanks to the through-silicon via (TSV), three-dimensional (3D) stacking technology has emerged as a practical solution to achieving the desired requirements of a high-performance circuit. In this paper, we present the benefits of 3D stacking and process technology scaling on stereo matching processors. We implemented 2-tier 3D-stacked stereo matching processors with GlobalFoundries 130-nm and Nangate 45-nm process design kits and compare them with their two-dimensional (2D) counterparts to identify comprehensive design benefits. In addition, we examine the findings from various analyses to identify the power benefits of 3D-stacked integrated circuit (IC) and device technology advancements. From experiments, we observe that the proposed 3D-stacked ICs, compared to their 2D IC counterparts, obtain 43% area, 13% power, and 14% wire length reductions. In addition, we present a logic partitioning method suitable for a pipeline-based hardware architecture that minimizes the use of TSVs.


Introduction
Various types of sensors that provide three-dimensional (3D) depth information include stereo cameras, radar, and time-of-flight cameras [1,2]. Recently, to obtain a dense depth map from a pair of stereo images, stereo matching processors have been adopted in real-time embedded systems such as intelligent robots and autonomous vehicles [3][4][5]. Important requirements for stereo matching processors are a high frame rate with minimal hardware resources and low power consumption [6]. In this paper, to achieve these requirements in the design of stereo matching processors, we adopt a TSV-based 3D stacking technology that has emerged as a practical solution for hardware miniaturization and low-power circuits. The stereo matching processor requires a wide bandwidth because of the memory-intensive nature of stereo image processing. Therefore, stereo matching processors are promising candidates for fully exploiting the benefits of 3D stacking technology.
Several related studies have examined the benefits of 3D stacking by comparing 3D integrated circuits (ICs) with their two-dimensional (2D) IC counterparts. Ouyang et al. described the design of 3D-stacked arithmetic units [7]. Thorolfsson et al. presented a 3D-stacked fast Fourier transform (FFT) processor composed of memory on logic and compared it with its 2D IC counterparts [8]. For a pair comparison, they used the same synthesis output, but they did not compare power consumption Table 1. Related studies on 3D-stacked IC design.

Authors 3D IC Design Process Technology Key Features
Ouyang et al. [7] Arithmetic units with three logic tiers Massachusetts Institute of Technology (MIT) Lincoln Lab's 180 nm 11.0%~46.1% reduction in power Thorolfsson et al. [8] FFT processor with two logic tiers and one static random access memory (SRAM) tier MIT Lincoln Lab's 180 nm 56.9% reduction in wire length, 4.4% reduction in logic power Neela et al. [9] Single precision floating-point unit with two logic tiers GlobalFoundries 130 nm 41.5% reduction in footprint, 3% increase in frequency Kim et al. [10] 64 processors with one logic tier and one SRAM tier GlobalFoundries 130 nm 63.8 GB/s memory bandwidth, power consumption up to 4.0 W Zhang et al. [11] Syntem-on-Chip with two logic tiers and three DRAM tiers GlobalFoundries 130 nm 12.57 mW power consumption, 8.5 GB/s bandwidth Saito et al. [12] Dynamic-reconfigurable memory with one logic tier and one SRAM tier 90-nm process technology 63% reduction in area and 43% reduction in latency Franzon et al. [13] Digital signal processor with two logic tiers and one SRAM tier MIT Lincoln Lab's 180 nm 25% reduction in total power (logic and memory) Oh et al. [14] Ternary content-addressable memory with three tiers MIT Lincoln Lab's 180 nm 21% reduction in total power In contrast to the goals of the work mentioned above [7][8][9][10][11][12][13][14], one of the primary goals of this research is to analyze the influence of reduced wire length that results from vertical stacking on power consumption by comparing 3D ICs with their 2D IC counterparts. In addition, we expand our previous work [15] to investigate the impact of technology process scaling (130 nm down to 45 nm) and TSV scaling (2.2 µm down to 0.8 µm) on the power and area benefits of 3D stacking. We design our 3D-stacked stereo matching processors using GlobalFoundries 130 nm and Nangate 45 nm process design kits (PDKs) and compare them with their 2D IC counterparts regarding power consumption and wire length. For a fair comparison of 3D and 2D ICs, we use the same target clock period and logic synthesis output during implementation. In this work, we also propose a pipeline-based partitioning scheme that reduces TSV use and then compare its performance with that of a conventional partitioning method. Our study is based on graphic database system II (GDSII)-level layouts, on the sign-off performance, and on a power analysis for a highly accurate assessment of the issues.
Overall, we summarize our contributions as follows: (1) We show the impact of 3D stacking on power consumption based on a practical implementation of 3D ICs; (2) to show the impact of partitioning on 3D ICs, we present two types of partitioning methods: the proposed pipeline-level partitioning and conventional macro-level partitioning; (3) we provide design considerations for low-power 3D IC design from a comprehensive analysis and implementation results; and (4) we study the impact of both device and TSV scaling on the power and area benefits of 3D ICs. This paper is divided into the following sections: Section 2 describes the stereo matching algorithm and its hardware architecture, and Section 3 discusses the design environments, design flow, and analysis flow. Section 4 compares the overall layout, presents the results of the detailed power analysis, and discusses the impact of switching activity, all of which are discussed in Section 5.

Matching Algorithm
Stereo matching involves the identification of corresponding pixels in a pair of stereo images and the extraction of three-dimensional information by computing the disparity between corresponding pixels by triangulation [1]. It often requires a huge amount of signal processing and a wide memory bandwidth for real-time stereo matching. Over the last decade, a large variety of stereo matching algorithms that can be grouped into local and global matching algorithms have been developed [2]. Local algorithms are faster than global algorithms, but the latter tend to be more accurate. Thus, the global algorithm is often implemented on a special platform such as graphic processing units because of its substantial computational overhead. Unlike the global algorithm, the local algorithm performs matching operations on the one-dimensional epipolar line by comparing the correlations of windows in a given search range, shown in Figure 1. As a result of this computational approach, the computational overhead of the local algorithm is less than that of the global algorithm, so the local algorithm is frequently adopted in embedded applications. We designed stereo matching processors based on a window-based local matching algorithm that requires a fixed size of window because this algorithm is more straightforward in a pipelined hardware architecture. We use 44 SRAM macros so that the window-based operations can handle the requirements of a wide bandwidth and the real-time processing capability of the matching algorithm.

Matching Algorithm
Stereo matching involves the identification of corresponding pixels in a pair of stereo images and the extraction of three-dimensional information by computing the disparity between corresponding pixels by triangulation [1]. It often requires a huge amount of signal processing and a wide memory bandwidth for real-time stereo matching. Over the last decade, a large variety of stereo matching algorithms that can be grouped into local and global matching algorithms have been developed [2]. Local algorithms are faster than global algorithms, but the latter tend to be more accurate. Thus, the global algorithm is often implemented on a special platform such as graphic processing units because of its substantial computational overhead. Unlike the global algorithm, the local algorithm performs matching operations on the one-dimensional epipolar line by comparing the correlations of windows in a given search range, shown in Figure 1. As a result of this computational approach, the computational overhead of the local algorithm is less than that of the global algorithm, so the local algorithm is frequently adopted in embedded applications. We designed stereo matching processors based on a window-based local matching algorithm that requires a fixed size of window because this algorithm is more straightforward in a pipelined hardware architecture. We use 44 SRAM macros so that the window-based operations can handle the requirements of a wide bandwidth and the real-time processing capability of the matching algorithm. The most basic step in the matching algorithm is the computation of the matching cost, which presents the similarity between the reference and candidate windows. It can be computed by several methods such as the sum of absolute differences and census transforms [16,17]. Several comprehensive studies have compared the matching accuracy of various types of matching cost computation methods, and several studies that have found that rank and census transforms are inherently robust to radiometric distortions of images have observed that the census transform outperforms other window-based stereo matching methods [18,19]. Thus, we adopt the census transform, which  The most basic step in the matching algorithm is the computation of the matching cost, which presents the similarity between the reference and candidate windows. It can be computed by several methods such as the sum of absolute differences and census transforms [16,17]. Several comprehensive studies have compared the matching accuracy of various types of matching cost computation methods, and several studies that have found that rank and census transforms are inherently robust to radiometric distortions of images have observed that the census transform outperforms other window-based stereo matching methods [18,19]. Thus, we adopt the census transform, which presents the characteristic feature of a window as a sequence of bit streams [20,21]. Figure 2 presents a flow diagram of the stereo matching processor, which performs an 11 × 11 window-based census transform using stereo images to find corresponding pixels in a pair of stereo images and computes the matching cost of the window based on the Hamming distance. After the post-processing of the matching results, the stereo matching processor generates a dense depth map. a flow diagram of the stereo matching processor, which performs an 11 × 11 window-based census transform using stereo images to find corresponding pixels in a pair of stereo images and computes the matching cost of the window based on the Hamming distance. After the post-processing of the matching results, the stereo matching processor generates a dense depth map.

Hardware Architecture
To run the matching algorithm using a pair of stereo images, the window-based stereo matching algorithm requires sufficient memory. Thus, the primary goal of the stereo matching architecture is to efficiently buffer both the left and right images on the memory so that the stereo matching processor can generate a depth map in real time. To fulfill this requirement and to deal with the requirement of a wide bandwidth, we used small, highly partitioned SRAMs and conducted a window-based operation using a finite number of them. Figure 3 illustrates how the window is generated with the finite number of SRAMs and then propagated horizontally. A pair of stereo images are consecutively acquired by a stereo camera, stored in the SRAMs in rows, and processed in columns for window-based matching. As a result, during each cycle, this architecture can concurrently perform multiple reads and a single write operation. Thus, the stereo matching processor can handle the requirements of a wide bandwidth.
As shown in Figure 3, the size of memory is determined primarily by the width of the image and the height of the window. The former equals the depth of each SRAM and the latter the number of SRAMs. As the width of the image increases, therefore, the size of the SRAM increases; as the size of the window increases, the required number of SRAMs increases accordingly. However, a large number of interconnections between the SRAMs and logic cells will cause performance degradation resulting from high routing congestion and the longer wire lengths in 2D ICs. We used an eight-bit, gray-level 752 × 480 image and a 15 × 15 window for the stereo matching and an 11 × 11 window for the post-processing. We chose these window sizes because of their high degree of matching accuracy [22]. Figure 4 shows the fully pipelined hardware architecture of the stereo matching processor, and Table 2 summarizes the features of the stereo matching processors.

Hardware Architecture
To run the matching algorithm using a pair of stereo images, the window-based stereo matching algorithm requires sufficient memory. Thus, the primary goal of the stereo matching architecture is to efficiently buffer both the left and right images on the memory so that the stereo matching processor can generate a depth map in real time. To fulfill this requirement and to deal with the requirement of a wide bandwidth, we used small, highly partitioned SRAMs and conducted a window-based operation using a finite number of them. Figure 3 illustrates how the window is generated with the finite number of SRAMs and then propagated horizontally. A pair of stereo images are consecutively acquired by a stereo camera, stored in the SRAMs in rows, and processed in columns for window-based matching. As a result, during each cycle, this architecture can concurrently perform multiple reads and a single write operation. Thus, the stereo matching processor can handle the requirements of a wide bandwidth. a flow diagram of the stereo matching processor, which performs an 11 × 11 window-based census transform using stereo images to find corresponding pixels in a pair of stereo images and computes the matching cost of the window based on the Hamming distance. After the post-processing of the matching results, the stereo matching processor generates a dense depth map.

Hardware Architecture
To run the matching algorithm using a pair of stereo images, the window-based stereo matching algorithm requires sufficient memory. Thus, the primary goal of the stereo matching architecture is to efficiently buffer both the left and right images on the memory so that the stereo matching processor can generate a depth map in real time. To fulfill this requirement and to deal with the requirement of a wide bandwidth, we used small, highly partitioned SRAMs and conducted a window-based operation using a finite number of them. Figure 3 illustrates how the window is generated with the finite number of SRAMs and then propagated horizontally. A pair of stereo images are consecutively acquired by a stereo camera, stored in the SRAMs in rows, and processed in columns for window-based matching. As a result, during each cycle, this architecture can concurrently perform multiple reads and a single write operation. Thus, the stereo matching processor can handle the requirements of a wide bandwidth.
As shown in Figure 3, the size of memory is determined primarily by the width of the image and the height of the window. The former equals the depth of each SRAM and the latter the number of SRAMs. As the width of the image increases, therefore, the size of the SRAM increases; as the size of the window increases, the required number of SRAMs increases accordingly. However, a large number of interconnections between the SRAMs and logic cells will cause performance degradation resulting from high routing congestion and the longer wire lengths in 2D ICs. We used an eight-bit, gray-level 752 × 480 image and a 15 × 15 window for the stereo matching and an 11 × 11 window for the post-processing. We chose these window sizes because of their high degree of matching accuracy [22]. Figure 4 shows the fully pipelined hardware architecture of the stereo matching processor, and Table 2 summarizes the features of the stereo matching processors.  As shown in Figure 3, the size of memory is determined primarily by the width of the image and the height of the window. The former equals the depth of each SRAM and the latter the number of SRAMs. As the width of the image increases, therefore, the size of the SRAM increases; as the size of the window increases, the required number of SRAMs increases accordingly. However, a large number of interconnections between the SRAMs and logic cells will cause performance degradation resulting from high routing congestion and the longer wire lengths in 2D ICs. We used an eight-bit, gray-level 752 × 480 image and a 15 × 15 window for the stereo matching and an 11 × 11 window for the post-processing. We chose these window sizes because of their high degree of matching accuracy [22]. Figure 4 shows the fully pipelined hardware architecture of the stereo matching processor, and Table 2 summarizes the features of the stereo matching processors.

Design Environments
We designed our 2D and 3D ICs using GlobalFoundries 130-nm and Nangate 45-nm process design kits (PDKs). To generate 44 single-port SRAM macros composed of six-transistor bit cells, we used an ARM memory compiler (ARM Inc., San Jose, CA, USA). We chose this commercial technology setting because it was successfully used in the development of a 3D IC built with GlobalFoundries 130-nm process technology and Tezzaron's TSV-based vertical stacking technology [10]. We connected two tiers of 3D ICs using via-first TSVs with the face-to-back bonding style for 3D integration, shown in Figure 5. The diameters of the TSVs were 2.2 μm for the 130-nm and 0.8 μm for the 45-nm designs. Because the capacitance and resistance of TSVs are not negligible in timing and power analyses, we used 10 fF and 2 fF for the capacitance of the TSV and 50 mΩ and 10 mΩ for the

Design Environments
We designed our 2D and 3D ICs using GlobalFoundries 130-nm and Nangate 45-nm process design kits (PDKs). To generate 44 single-port SRAM macros composed of six-transistor bit cells, we used an ARM memory compiler (ARM Inc., San Jose, CA, USA). We chose this commercial technology setting because it was successfully used in the development of a 3D IC built with GlobalFoundries 130-nm process technology and Tezzaron's TSV-based vertical stacking technology [10]. We connected two tiers of 3D ICs using via-first TSVs with the face-to-back bonding style for 3D integration, shown in Figure 5. The diameters of the TSVs were 2.2 µm for the 130-nm and 0.8 µm for the 45-nm designs. Because the capacitance and resistance of TSVs are not negligible in timing and power analyses, we used 10 fF and 2 fF for the capacitance of the TSV and 50 mΩ and 10 mΩ for the resistance of the TSV, respectively, during the timing and power analyses of our 130-nm and 45-nm designs. We used simulated values for the capacitance and resistance of TSVs.  Figure 6 presents the design flow for the 2D and 3D IC designs. From the given register transfer level (RTL) description of the stereo matching processor written in Verilog hardware description language (HDL), we used a conventional design flow for the 2D IC design. We generated top-level synthesized netlists of the 130-nm and 45-nm designs using Synopsys's Design Compiler and two types of PDKs. For a fair comparison of the 2D and 3D ICs, we used the same synthesized netlist for both. For the 2D IC layout, we performed floor planning, placement, clock-tree synthesis, routing, and timing optimization using Cadence Encounter. Table 3 summarizes the results of the synthesis. For the primary step of the 3D IC design, we used two methods of partitioning. In one method, we separated the gates of the functional modules and the SRAM macros in the top-level netlist, assigned them to the top and bottom tiers, respectively, and then determined the required number of  Figure 6 presents the design flow for the 2D and 3D IC designs. From the given register transfer level (RTL) description of the stereo matching processor written in Verilog hardware description language (HDL), we used a conventional design flow for the 2D IC design. We generated top-level synthesized netlists of the 130-nm and 45-nm designs using Synopsys's Design Compiler and two types of PDKs. For a fair comparison of the 2D and 3D ICs, we used the same synthesized netlist for both. For the 2D IC layout, we performed floor planning, placement, clock-tree synthesis, routing, and timing optimization using Cadence Encounter. Table 3 summarizes the results of the synthesis.

Design Flow
For the primary step of the 3D IC design, we used two methods of partitioning. In one method, we separated the gates of the functional modules and the SRAM macros in the top-level netlist, assigned them to the top and bottom tiers, respectively, and then determined the required number of TSVs needed for interconnecting each tier. In the other method, we partitioned the top-level synthesized netlist in a pipeline-level style using "group" and "ungroup" commands in the Synopsys Design Compiler; we then extracted the partitioned netlist for each tier and inserted the TSVs into the netlist. The "ungroup" and "group" commands are used to remove the hierarchy of the top-level synthesized netlist and to create two partitioned netlists for two tiers of 3D IC design, respectively. We placed the TSVs prior to gate placement and did the layout separately for each tier in the same way that we performed the conventional 2D IC layout. Table 4 presents a comparison between the TSV usage of the proposed pipeline-level and conventional macro-level partitioning methods. For the macro-level partitioned 3D IC design, we uniformly placed 425 signal TSVs according to the location of the input and output of each SRAM macro. In the case of the pipeline-level partitioned 3D IC design, we simply placed 221 signal TSVs in the center area of the top tier because most of them, which came from logic cells on the top tier, connected to memory macros on the bottom tier. We did not optimize the locations of the TSVs, which was beyond the scope of this study. (HDL), we used a conventional design flow for the 2D IC design. We generated top-level synthesized netlists of the 130-nm and 45-nm designs using Synopsys's Design Compiler and two types of PDKs. For a fair comparison of the 2D and 3D ICs, we used the same synthesized netlist for both. For the 2D IC layout, we performed floor planning, placement, clock-tree synthesis, routing, and timing optimization using Cadence Encounter. Table 3 summarizes the results of the synthesis. For the primary step of the 3D IC design, we used two methods of partitioning. In one method, we separated the gates of the functional modules and the SRAM macros in the top-level netlist, assigned them to the top and bottom tiers, respectively, and then determined the required number of TSVs needed for interconnecting each tier. In the other method, we partitioned the top-level synthesized netlist in a pipeline-level style using "group" and "ungroup" commands in the Synopsys Design Compiler; we then extracted the partitioned netlist for each tier and inserted the TSVs into the netlist. The "ungroup" and "group" commands are used to remove the hierarchy of the top-level synthesized netlist and to create two partitioned netlists for two tiers of 3D IC design, respectively. We placed the TSVs prior to gate placement and did the layout separately for each tier in the same  We divided the top-level netlist into logic gates and memory macros, shown in Figure 7a. As the gates are placed vertically over the memory macros, this partitioning method minimizes the wire lengths between the logic and memory macros, thus maximizing the benefits of the 3D ICs. Because all TSVs and die sizes are proportional to the number the macros, as the number of macros increases, the number of TSVs and sizes of dies also increase. In this case, if the total area of the macros is not proportional to the total area of the logic cells, maintaining a balance between the utilization ratios of each tier is difficult. Moreover, the larger number of TSVs increases the overhead of the silicon area, resulting in routing congestion. lengths between the logic and memory macros, thus maximizing the benefits of the 3D ICs. Because all TSVs and die sizes are proportional to the number the macros, as the number of macros increases, the number of TSVs and sizes of dies also increase. In this case, if the total area of the macros is not proportional to the total area of the logic cells, maintaining a balance between the utilization ratios of each tier is difficult. Moreover, the larger number of TSVs increases the overhead of the silicon area, resulting in routing congestion.

Pipeline-Level Partitioning (PP) Method
The purpose of this partitioning method is to minimize the number of TSVs and to balance the die sizes of the tiers with minimal effort in a pipelined hardware architecture. In this partitioning method, we adhere to the basic concept of pipelining (in which memory is dedicated to its own pipeline stage, shown in Figure 4) by simply splitting the pipeline stages into two groups and

Pipeline-Level Partitioning (PP) Method
The purpose of this partitioning method is to minimize the number of TSVs and to balance the die sizes of the tiers with minimal effort in a pipelined hardware architecture. In this partitioning method, we adhere to the basic concept of pipelining (in which memory is dedicated to its own pipeline stage, shown in Figure 4) by simply splitting the pipeline stages into two groups and assigning the first three pipeline stages (1, 2 and 3) to the top tier and the remaining pipeline stages (4, 5 and 6) to the bottom tier. In this case, the number of TSVs is determined by the number of signals between the two groups. However, simply dividing the pipeline stages into two groups does not guarantee a balance of die sizes of the tiers. Therefore, we balanced them by moving the memory macros from the top to the bottom tier. In this case, the number of TSVs increased as the number of adjusted memory macros increased. Figure 8 illustrates the two steps of the pipeline-level partitioning method. In this example, we assumed that the silicon areas of each stage were identical. assigning the first three pipeline stages (1, 2 and 3) to the top tier and the remaining pipeline stages (4, 5 and 6) to the bottom tier. In this case, the number of TSVs is determined by the number of signals between the two groups. However, simply dividing the pipeline stages into two groups does not guarantee a balance of die sizes of the tiers. Therefore, we balanced them by moving the memory macros from the top to the bottom tier. In this case, the number of TSVs increased as the number of adjusted memory macros increased. Figure 8 illustrates the two steps of the pipeline-level partitioning method. In this example, we assumed that the silicon areas of each stage were identical.

Comparison of the Partitioning Methods
For the 3D IC design, we used both the conventional macro-level and proposed pipeline-level partitioning methods. The 3D IC with the macro-level partitioning method (3D-MP), which consists of a logic tier (top tier) and a memory macro tier (bottom tier), minimizes wire lengths between the logic and memory macros, shown in Figure 7a. For the 3D-MP design, we uniformly placed 425 signal TSVs on the top tier according to the location of the input and output ports of each memory macro. For the 3D IC design with the pipeline-level partitioning method (3D-PP), we split the pipeline stages to minimize the number of TSVs and then balanced the cell area of each tier by adjusting the number of memory macros of each tier. We assigned the first three pipeline stages (1, 2 and 3) of Figure 4 to the top tier, and the remaining pipeline stages (4, 5 and 6) to the bottom tier. In this case, before adjusting the number of memory macros of each tier, we assigned 67.5% of the cell area to the top

Comparison of the Partitioning Methods
For the 3D IC design, we used both the conventional macro-level and proposed pipeline-level partitioning methods. The 3D IC with the macro-level partitioning method (3D-MP), which consists of a logic tier (top tier) and a memory macro tier (bottom tier), minimizes wire lengths between the logic and memory macros, shown in Figure 7a. For the 3D-MP design, we uniformly placed 425 signal TSVs on the top tier according to the location of the input and output ports of each memory macro. For the 3D IC design with the pipeline-level partitioning method (3D-PP), we split the pipeline stages to minimize the number of TSVs and then balanced the cell area of each tier by adjusting the number of memory macros of each tier. We assigned the first three pipeline stages (1, 2 and 3) of Figure 4 to the top tier, and the remaining pipeline stages (4, 5 and 6) to the bottom tier. In this case, before adjusting the number of memory macros of each tier, we assigned 67.5% of the cell area to the top tier and 32.5% to the bottom tier. Thus, with the same footprint size, the top tier will suffer more from routing congestion. From the cell area report, shown in Table 5, we learned that each SRAM macro occupies about 1.4% of the cell area, so to balance the cell area of each tier, we moved 12 SRAM macros from the top to the bottom tier. For the 3D-PP design, we placed the cells in pipeline stages 3 and 4 in the central part of the top and bottom tiers and connected them using TSVs located in the central part of the layout mainly because the cells in pipelines 1 and 6 are placed on the boundary of the die for the pad connection. In addition, since it was not the goal of this study to find the optimal locations of TSVs for 3D ICs, we placed the TSVs in the center area of the chip for the 3D-PP design.

Timing and Power Analysis Flow
We conducted a static timing analysis (STA) using Synopsys PrimeTime with the layout netlist and an RC parasitic file that contained the resistance and capacitance values for all of the nets. Then, if the timing was met, we performed a power analysis using PrimeTime. Figure 9 presents the flow of the power and timing analyses. We performed the timing and power analyses using PrimeTime with the layout netlist and the RC parasitic file of each tier. Although existing commercial tools can perform a timing and power analyses for 2D IC design, they cannot handle 3D IC designs. Thus, for the 3D IC analysis, we created a top-level netlist by combining the netlist of the tiers and a top-level RC parasitic file for the TSVs. Then, we merged three parasitic files (two from the two dies and one from the TSVs) into one and used it to perform 3D timing analysis. We also used the combined parasitic files to obtain timing constraints at the die boundary and performed timing optimization. If the timing was met, we conducted a power analysis. Our power comparisons were done in iso-performance. We used the same target clock period during the layout and timing optimization for both the 2D and 3D designs. We used 3.2 ns for the 130-nm and 1.8 ns for the 45-nm 2D and 3D ICs. The clock-tree synthesis for the 3D IC was difficult because no commercial EDA tools are able to fully handle clock trees for 3D ICs. Thus, we treated each tier as if it had its own clock-tree network and then performed clock-tree synthesis separately. Then, we directly connected the clock sources of the top and bottom tiers through a TSV.
iso-performance. We used the same target clock period during the layout and timing optimization for both the 2D and 3D designs. We used 3.2 ns for the 130-nm and 1.8 ns for the 45-nm 2D and 3D ICs. The clock-tree synthesis for the 3D IC was difficult because no commercial EDA tools are able to fully handle clock trees for 3D ICs. Thus, we treated each tier as if it had its own clock-tree network and then performed clock-tree synthesis separately. Then, we directly connected the clock sources of the top and bottom tiers through a TSV.   Figure 10 presents the comparisons between the normalized design quality of the 2D and 3D ICs, whose layout snapshots, designed in 130-nm and 45-nm technologies, are shown in Figures 11  and 12, respectively, and whose overall layout results are summarized in Tables 6 and 7. First of all, we observe that the 3D ICs deliver significant performance improvements over the 2D ICs in their overall metrics: footprint, power, and wire length. The chip footprints are as much as 43% smaller than those of the 2D ICs in both the 130-nm and 45-nm process technology-based 3D ICs. In addition, in the case of the 130-nm process technology-based 3D ICs (3D-MP-130 and 3D-PP-130), the total wire lengths are 14% and 4%, respectively, shorter than those of the 2D IC (2D-130) because of vertical stacking and smaller footprints. We also observe that the total number of buffers of the 3D-MP-130 and 3D-PP-130 are nearly 19% and 18% less than that of the 2D-130, mainly because of shortened wire lengths. As a result, 3D-MP-130 and 3D-PP-130 consume 13% and 7% less power, respectively, than 2D-130. Similarly, the total wire lengths of the 45-nm process technology-based 3D ICs (3D-MP-45 and 3D-PP-45) are 11% and 3% shorter than that of the 2D IC (2D-45); the numbers of buffers are 30% and 35% less than that of 2D-45; and the power consumption of 3D-MP-45 and 3D-PP-45 is 8% and 7% less than that of 2D-45, respectively.

Overall Layout Comparisons
In contrast to the results of the comparison of footprints, wire lengths, buffers, and power, the results of 3D stacking from the clock-tree analysis did not exhibit any benefits. In the case of the 3D-PP-130, although the wire length of the clock tree was 7% shorter than that of the 2D-130, its number of clock-tree buffers was 4% higher than that of the 2D IC. Similarly, the number of clock-tree buffers of 3D-MP-45 and 3D-PP-45 were 1% and 10% higher, respectively, than that of 2D-45. One explanation for this finding is that we performed clock-tree synthesis separately for the top and bottom tiers and then directly connected the clock source of the bottom to that of the top tier through a TSV. In this case, we used existing commercial electronic design automation (EDA) tools to perform the clock-tree synthesis without any awareness of the other clock tree. Thus, the clock tree of the 3D IC was not well optimized. As 3D-MP-130 and 3D-MP-45, however, did not have a large clock-tree network on the bottom tier, they suffered less from the clock-tree optimization problem.
clock-tree network on the bottom tier, they suffered less from the clock-tree optimization problem.
Compared with 3D-PP, the 3D-MP uses a larger number of TSVs but outperforms 3D-PP, particularly regarding the wire length and the number of clock tree buffers. This finding could result from several factors: (1) the placement of TSVs in 3D-PP may not be optimal; or (2) the clock-tree network of 3D-PP is not optimized, so it has more clock buffers than that of 3D-MP. From these observations, we learn that the optimal placement of TSVs and the synthesis of the 3D clock tree play important roles in 3D IC design.      Compared with 3D-PP, the 3D-MP uses a larger number of TSVs but outperforms 3D-PP, particularly regarding the wire length and the number of clock tree buffers. This finding could result from several factors: (1) the placement of TSVs in 3D-PP may not be optimal; or (2) the clock-tree network of 3D-PP is not optimized, so it has more clock buffers than that of 3D-MP. From these observations, we learn that the optimal placement of TSVs and the synthesis of the 3D clock tree play important roles in 3D IC design.

Detailed Power Analysis
To study the source of the power reduction in 3D ICs, we performed various power analyses. First, as shown in Figures 13 and 14, we observe that most power savings are achieved in the combinational logic and net switching power, indicating that the reduced number of buffers in the interconnection plays an important role in reducing the power consumption of 3D IC designs. In addition, we also observe that in contrast to the 3D ICs designed in 130-nm technology, which have negligible cell leakage power, the cell leakage power of the 45-nm 3D IC is as much as 7% less than that of the 2D IC. This is mainly because the number of buffers of the 45-nm process technology-based 3D ICs (3D-MP-45 and 3D-PP-45) are 30% and 35% less than that of 2D IC (2D-45), which results from vertical stacking and the reduced wire length. This result also indicates that as the process technology scales down, cell leakage power increases power consumption mainly because leakage power in 130-nm technology is negligible, but the proportion of leakage power in 45-nm technology is higher than it is in 130-nm technology. Tables 8 and 9 show how power consumption breaks down to cell internal, net switching, and cell leakage across power groups. From the tables, we observe that memory macros consume around 38% and 32% of the power in 130-nm and 45-nm process technologies, respectively, because of the memory-intensive nature of the stereo matching processor. In addition, we observe that the clock network consumes a relatively large portion of total power because of its high switching activity. In the case of a 3D IC designed in 130-nm technology, for example, the clock network consumes over 30% of power even though the lengths of wires and the number of buffers of the clock tree comprise only around 4% of all wire length and 5% of all buffers. Comparing 130-nm 3D ICs and 45-nm 3D ICs, we also observe that as the technology scales down, the proportion of switching power increases while the proportion of internal power decreases, indicating that as the technology scales down, switching power leads to greater power savings.
technology is negligible, but the proportion of leakage power in 45-nm technology is higher than it is in 130-nm technology. Tables 8 and 9 show how power consumption breaks down to cell internal, net switching, and cell leakage across power groups. From the tables, we observe that memory macros consume around 38% and 32% of the power in 130-nm and 45-nm process technologies, respectively, because of the memory-intensive nature of the stereo matching processor. In addition, we observe that the clock network consumes a relatively large portion of total power because of its high switching activity. In the case of a 3D IC designed in 130-nm technology, for example, the clock network consumes over 30% of power even though the lengths of wires and the number of buffers of the clock tree comprise only around 4% of all wire length and 5% of all buffers. Comparing 130-nm 3D ICs and 45-nm 3D ICs, we also observe that as the technology scales down, the proportion of switching power increases while the proportion of internal power decreases, indicating that as the technology scales down, switching power leads to greater power savings.

Impact of Switching Activity
To investigate the impact of switching activity on the power benefits in 3D ICs, we performed a detailed power analysis while varying the switching activity. We specified the target static probability values of the switching activity on the primary inputs, which varied from 0.1 to 0.5 and used PrimeTime to propagate the switching activity to the rest of the circuits. The varying switching activity is therefore represented as the various workloads of stereo matching processors. Figures 15 and 16 present comparisons of the normalized power of 2D and 3D ICs as a function of the switching activity. First of all, we observe that the overall power savings of 3D-MP-130, 3D-MP-45, and 3D-PP-45 over their 2D IC counterparts increase as the switching activity increases. This increase is due to the reduction in power mainly from the shortened wire lengths of the 3D ICs. This finding indicates that as switching activity increases, the power benefits of 3D ICs become more significant, but in the case of 3D-PP-130, power savings over 2D-130 do not change. After all, neither the lengths of the clock-tree wires nor the number of clock buffers of the 3D-PP-130 decrease, confirming that as the switching activity increases, the clock-tree network becomes more important for power savings.
probability values of the switching activity on the primary inputs, which varied from 0.1 to 0.5 and used PrimeTime to propagate the switching activity to the rest of the circuits. The varying switching activity is therefore represented as the various workloads of stereo matching processors. Figures 15  and 16 present comparisons of the normalized power of 2D and 3D ICs as a function of the switching activity. First of all, we observe that the overall power savings of 3D-MP-130, 3D-MP-45, and 3D-PP-45 over their 2D IC counterparts increase as the switching activity increases. This increase is due to the reduction in power mainly from the shortened wire lengths of the 3D ICs. This finding indicates that as switching activity increases, the power benefits of 3D ICs become more significant, but in the case of 3D-PP-130, power savings over 2D-130 do not change. After all, neither the lengths of the clock-tree wires nor the number of clock buffers of the 3D-PP-130 decrease, confirming that as the switching activity increases, the clock-tree network becomes more important for power savings.

Comparisons of the Results with the Related Studies
We compared the reduction in power and wire length with the related studies, summarized in Table 1. However, because of the different types of case studies and goals, only a few studies compared the 3D IC with its 2D IC counterpart. Table 10 summarizes the comparisons of the power and wire length reduction with the related studies. Ouyang et al. focused on delay reduction in 3D-stacked arithmetic units [7]. Kim et al. [10], Zhang et al. [11], and Saito et al. [12] focused on the feasibility of 3D-stacked IC designs. Thorolfsson et al. [8], Neela et al. [9], Franzon et al. [13] and Oh et al. [14] compared 3D ICs with their 2D IC counterparts. Thorolfsson et al. [8] and Franzon et al. [13] presented a 3D-stacked FFT processor composed of memory on logic and compared it with its 2D IC counterpart. They achieved 4.4% and 56.9% reductions in power and wire length using MIT Lincoln Labs' manufacturing process. Neela et al. implemented a 3D-stacked single precision floating-point unit. However, it consumes more power than that of its 2D IC counterpart mainly because the increased wire length for the routing signal to micro bumps [9]. Oh et al. [14] demonstrated a 3D-stacked ternary content-addressable memory and achieved 21.5% reduction in power.

Comparisons of the Results with the Related Studies
We compared the reduction in power and wire length with the related studies, summarized in Table 1. However, because of the different types of case studies and goals, only a few studies compared the 3D IC with its 2D IC counterpart. Table 10 summarizes the comparisons of the power and wire length reduction with the related studies. Ouyang et al. focused on delay reduction in 3D-stacked arithmetic units [7]. Kim et al. [10], Zhang et al. [11], and Saito et al. [12] focused on the feasibility of 3D-stacked IC designs. Thorolfsson et al. [8], Neela et al. [9], Franzon et al. [13] and Oh et al. [14] compared 3D ICs with their 2D IC counterparts. Thorolfsson et al. [8] and Franzon et al. [13] presented a 3D-stacked FFT processor composed of memory on logic and compared it with its 2D IC counterpart. They achieved 4.4% and 56.9% reductions in power and wire length using MIT Lincoln Labs' manufacturing process. Neela et al. implemented a 3D-stacked single precision floating-point unit. However, it consumes more power than that of its 2D IC counterpart mainly because the increased wire length for the routing signal to micro bumps [9]. Oh et al. [14] demonstrated a 3D-stacked ternary content-addressable memory and achieved 21.5% reduction in power.

Conclusions
We described the benefits of TSV-based 3D stacking and the impact of device technology scaling on the performance of the stereo matching processor. We also presented comprehensive comparisons of 2D and 3D-stacked stereo matching processors designed in 130-nm and 45-nm technologies. In addition, we presented the overall RTL-to-GDSII design flow and described the power and timing analysis flow for a TSV-based 3D IC design. From the experimental results, we observed that the reduced power consumption of our 3D-stacked stereo matching processors is mainly due to shortened wire lengths and reduced buffer use resulting from 3D stacking. In the case of the 3D ICs designed with 130-nm process technology, our pipeline-level method reduces total power by 7% and our macro-level method by 13%. Likewise, in the case of 3D ICs designed with 45-nm process technology, our pipeline-level method achieved 7% and our macro-level partitioning method 8% savings.
This paper demonstrated significant power reductions with our 3D-stacked stereo matching processors designed in either the 130-nm or the 45-nm process technologies. From the experimental results, we showed that as the switching activity increases, the clock-tree network and buffers of 3D ICs consume less power. In addition, we showed that the proposed pipeline-level partitioning method minimizes TSV usage while balancing the area of each tier. The reduction in the number of signal TSVs, however, does not always lead to an optimal design with regard to total wire length and power reduction. Therefore, to fully exploit the benefits of vertical stacking, designers should optimally place TSVs in physical layouts.