A Uniform Architecture Design for Accelerating 2 D and 3 D CNNs on FPGAs

Three-dimensional convolutional neural networks (3D CNNs) have gained popularity in many complicated computer vision applications. Many customized accelerators based on FPGAs are proposed for 2D CNNs, while very few are for 3D CNNs. Three-D CNNs are far more computationally intensive and the design space for 3D CNN acceleration has been further expanded since one more dimension is introduced, making it a big challenge to accelerate 3D CNNs on FPGAs. Motivated by the finding that the computation patterns of 2D and 3D CNNs are very similar, we propose a uniform architecture design for accelerating both 2D and 3D CNNs in this paper. The uniform architecture is based on the idea of mapping convolutions to matrix multiplications. A customized mapping module is developed to generate the feature matrix tilings with no need to store the entire enlarged feature matrix on-chip or off-chip, a splitting strategy is adopted to reconstruct a convolutional layer to adapt to the on-chip memory capacity, and a 2D multiply-and-accumulate (MAC) array is adopted to compute matrix multiplications efficiently. For demonstration, we implement an accelerator prototype with a high-level synthesis (HLS) methodology on a Xilinx VC709 board and test the accelerator on three typical CNN models: AlexNet, VGG16, and C3D. Experimental results show that the accelerator achieves state-of-the-art throughput performance on both 2D and 3D CNNs, with much better energy efficiency than the CPU and GPU.


Introduction
In recent years, convolutional neural networks (CNNs) have gained great success in various computer vision applications such as image classification [1], object detection [2], and face recognition [3].CNNs have been primarily applied on 2D images to automatically extract spatial features and have significantly enhanced the image classification accuracy.To effectively incorporate the motion information in video analysis, 3D CNNs with spatiotemporal convolutional kernels are proposed.Owing to the ability to capture both spatial and temporal features, 3D CNNs have been proved to be very effective in many video-based applications including object recognition [4], hand gesture recognition [5], and human action recognition [6].
CNNs require vast amounts of memory as there are millions of parameters in a typical CNN model.Meanwhile, CNNs are computationally intensive with over billions of operations for the inference of one input.For example, VGG16 [7], a real-life 2D CNN model for image classification with 16 layers, takes around 31 GOPs for the inference of one image.C3D [6], a real-life 3D CNN model for memory consumption.The second challenge is that the weight matrix and feature matrix are enlarged by a factor of the kernel temporal depth in 3D convolutions compared to 2D CNNs.Accordingly, it lifts the memory consumption by a factor of the kernel temporal depth when storing the weight matrix and feature matrix on-chip.To guarantee the uniform architecture can be applied to large CNN models and be deployed on platforms with limited on-chip memory capacity, we adopt an effective splitting strategy.A convolutional layer with a large amount of input channels will be split into multiple convolutional layers with a smaller amount of input channels.The third challenge is how to compute matrix multiplications efficiently on FPGAs.Different to the OpenCL-based computation framework in [18,19], we adopt a 2D MAC array for matrix multiplications.The 2D MAC array is scalable and the size is mainly determined according to the hardware resources, memory bandwidth and the size of feature maps.
To summarize, our key contributions are as follows: • We propose a uniform accelerator architecture design supporting both 2D and 3D CNNs, based on the idea of mapping convolutions to matrix multiplication operations.Special efforts are made on memory optimizations and computations to enhance throughput performance; • We analytically model the resource utilization and throughput performance of our architecture, which helps to configure an accelerator on a specific platform within certain constraints including hardware performance, memory bandwidth and clock frequency; • We demonstrate the architecture design by implementing an accelerator on the Xilinx VC709 board with the High-level synthesis (HLS) methodology.Three typical CNN models including AlexNet, VGG16, and C3D, are tested on the accelerator.Experimental results show that the accelerator achieves over 850 GOP/s for convolutional layers and nearly 700 GOP/s overall on VGG16 and C3D, with much better energy efficiency than the CPU and GPU.
The rest of the paper is organized as follows: Section 2 briefly introduces the basic background of CNNs and the design directions of the accelerator architecture; Section 3 presents the architecture design and the main components; Section 4 provides the implementation and optimization details; Section 5 presents the accelerator modeling; Section 6 reports the experimental results; and finally, Section 7 concludes the paper.

CNN Basics and Accelerator Design Directions
In this section, we briefly review the operations in convolutional layers of 2D and 3D CNNs and then the accelerator design directions on FPGAs are introduced.The fully-connected layers, activation, and pooling layers are omitted due to their relative simplicity.

Operations in Convolutional Layers of 2D CNNs
Figure 1a illustrates the process of a 2D convolution.A 2D convolution applied to an image results in another image.In convolutional layers of 2D CNNs, the input and output features are images with multiple channels.For example, the input feature of the first convolutional layer has three channels: Red, green, and blue.A 2D convolution is applied to each channel of the input feature and the generated images are then accumulated resulting in one channel of the output feature, as shown in Figure 1b.For simplicity, we use X to indicate the input feature with a size of c × h × w and Y to indicate the output feature with a size of m × h × w.Here, c and m are the number of input and output channels, and h and w are height and width of each feature.We use W to indicate the weights, which contains m × c kernels with a size of k × k.Suppose the sliding stride is 1, then each pixel Y[mm][hh][ww] in the output feature is calculated by:

Operations in Convolutional Layers of 3D CNNs
Figure 1c illustrates the process of a 3D convolution.A 3D convolution applied to a video volume results in another volume.Similarly, in convolutional layers of 3D CNNs, the input and output features are video volumes with multiple channels.A 3D convolution is applied to each channel of the input feature and the generated volumes are then accumulated resulting in one channel of the output feature, as shown in Figure 1d.The input and output features are indicated by X with a size of c × l × h × w and Y with a size of m × l × h × w, where l is the number of frames while the other variables have the same meaning as above.The weights W contains a total of m × c kernels with a size of d × k × k, where d is the kernel temporal depth and k is the kernel spatial size.Suppose the sliding stride is 1, then each pixel Y[mm][ll][hh][ww] in the output feature is given by: Compared to the convolutional layers in 2D CNNs, there is an accumulation along the temporal dimension, as shown in Equation (2).By switching the order of the two outer accumulations, we get Equation (3).We can find that the inner three accumulations in Equation ( 3) are very similar to Equation (1) since the loop variable along the temporal dimension dd is fixed.
We can further combine the outer two loops and hence get Equation (4), which is almost the same as Equation (1) except that the number of input channels is enlarged by a factor of d.That is to say, 3D convolutions can be computed in the same way as 2D convolutions.

Convolution as Matrix Multiplication
Two-D convolutions can be mapped as matrix multiplication operations by flattening and rearranging the weights and input features.As illustrated in Figure 2, 64 × 3 kernels with a size of 3 × 3 are mapped to a rearranged matrix with dimensions of 64 × (3 × 3 × 3).All three kernels belonging to the same group are flattened horizontally to form a row of the weight matrix.Meanwhile, all three input features with dimensions of 32 × 32 are mapped to a rearranged matrix with dimensions of (3 × 3 × 3) × (32 × 32).All the pixels covered by the first convolution window in each channel are flattened vertically to form the first column of the feature matrix.The entire feature matrix can be generated by sliding the convolution window across the features along the column and row directions.After rearrangement, the convolutions are transformed to a general matrix multiplication.The result of the matrix multiplication is an output matrix with dimensions of 64 × (32 × 32), which is the flattened format of the output features.Notice that the number of the pixels in the feature matrix is (k × k)-fold to that in the input feature, as each pixel is covered by the convolution window k × k times during sliding the convolution window across the features and hence replicated k × k times.Similarly, 3D convolutions can also be mapped as matrix multiplications with the same method.Compared to 2D convolutions, the convolution window slides across the input features not only along the row and column directions, but also the temporal direction.Consequently, each pixel is covered by the convolution window d × k × k times and hence replicated d × k × k times.The number of pixels in the input feature is enlarged by a factor of d × k × k.
Here we can find that the approach of mapping convolutions as matrix multiplication operations introduces a high degree of data replications.In CPU and GPU implementations, the entire feature matrix is generated first before computing matrix multiplication.This is not a big problem to CPUs and GPUs as they have abundant memory space and bandwidth.However, to FPGAs with limited on-chip memory capacity and off-chip memory bandwidth, storing the entire feature matrix can be a critical limitation.We will show how we optimize this approach in FPGA implementations with a customized matrix mapping module in the next section.

Splitting Strategy
As shown in Figure 2, all channels of the input feature are involved to generate one column of the feature matrix.When loading the required pixels of the input feature from off-chip memory to on-chip memory, the burst access pattern is typically adopted to lift memory bandwidth utilization rate.That is to say, we have to store at least several rows for each input channel of the input feature on-chip.In a convolutional layer of a large-scale 2D CNN model, the number of input channels may be very large, 512 or even 1024, which will consume plenty of on-chip memory space.This can be even more severe for a 3D CNN model as the number of input channels is enlarged by a factor of d.Considering that the target hardware platform may have very limited on-chip memory, we adopt a splitting strategy that splits a convolutional layer with a large number of input channels to multiple convolutional layers with a smaller number of input channels.An independent sum layer is introduced to accumulate the results of two convolutional layers.Suppose the on-chip memory can store at most ic_max input channels, then a convolutional layer with c input channels will be split into c/ic_max convolutional layers with ic_max input channels each.Additionally, c/ic_max − 1 sum layers will be introduced to accumulate the partial results.The splitting strategy is a kind of matrix blocking or partitioning method in essence.Different to block matrix multiplication, an independent sum layer is introduced which eliminates the need to store intermediate results on-chip and hence saves on-chip memory consumption.

Hardware Architecture Design
We propose a uniform architecture design for accelerating both 2D and 3D CNNs based on the idea of mapping convolutions to matrix multiplication operations.The architecture can adapt to convolutional layers with different input dimensions and kernel sizes, and can support large-scale CNN models owing to the splitting strategy.In this section, the details of the customized matrix mapping module, the computation framework, the buffers, and the whole architecture will be presented.

Matrix Mapping Module
As introduced above, the approach of mapping convolutions as matrix multiplication operations introduces a high degree of data replications, which will lift the memory access overheads when storing the entire feature matrix off-chip.We propose a customized matrix mapping module that avoids data replications by reusing the overlapped data during the sliding of convolutional windows.
Figure 3a-c illustrate how the convolution window slides across columns of the input feature.The pixel in green is involved in k convolutions.Accordingly, it appears in the feature matrix k times in k consecutive rows.We can generate k rows of data in the feature matrix by simply shifting the related row in the input feature k times, as illustrated in Figure 3d. Figure 3e shows the status after the convolution window slides vertically across the rows.There are k − 1 overlapped rows between two consecutive slides across the rows, which can be reused when the convolution window slides across the columns.Therefore, each pixel is used k × k times during the process.
To save on-chip memory consumption, we store only k + stride rows for each channel of the input feature (stride = 1 in most cases) instead of the entire input feature.The first k rows are for the activated data involved in current convolutions while the left stride rows are pre-cached for the next slide across the rows.We partition the input feature by the column dimension to provide enough data ports for data shifting, as shown in Figure 3f.The matrix mapping module loads k rows of each channel at the beginning, shifts each row k times to generate c × k × k rows, and writes the feature matrix block to the feature FIFOs, as illustrated in Figure 3g.Then the matrix multiplication starts and calculates the first row of the output feature.During the process of matrix multiplication, the next stride rows in each channel of the input feature will be loaded and replace stride rows of data in the same channel according to the First-In-First-Out policy.The process repeats until all the pixels in the output feature are calculated.

2D MAC Array
If A and B are matrices with dimensions of M × N and N × L respectively, then the product matrix C is given by: We adopt a 2D MAC array to compute matrix multiplication in the most straightforward way.A MAC unit is composed of a multiplier to calculate the products, an adder to accumulate the products, and a register to keep the partial sum.The MAC unit located at (i, j) receives operands from the i-th row of matrix A and j-th column of matrix B and generates the pixel C[i][j].In the case of CNN accelerations, A is the weight matrix, B is the feature matrix, and C is the output matrix.Suppose there are mr rows and mc columns in the MAC array, then a total of mr × mc pixels can be generated at once.We adopt a simple matrix partitioning strategy to compute the whole output matrix with the 2D MAC array.As shown in Figure 4, the weight matrix is partitioned into m/mr matrix blocks along the row dimension and the feature matrix is partitioned into (h × w)/mc matrix blocks along the column dimension.
The 2D MAC array exploits the parallelism of convolutions in two aspects.The output channel loop is unrolled by a factor of mr and hence mr channels of the output feature can be calculated simultaneously.The column loop of each channel is unrolled by a factor of mc and thus mc pixels in the same row of a channel can be computed in parallel.The 2D MAC array is scalable and the size is determined according to the hardware resources, memory bandwidth, feature size and the number of input channels.Hardware resources, especially the DSP slices on a FPGA chip, determine the maximum number of MAC units that can be assigned to the 2D MAC array.The width of the 2D MAC array is mainly restricted by the memory bandwidth.We can find an optimal value for the width so that the 2D MAC array is well matched with the memory bandwidth.The 2D MAC array will be under-utilized when the real width is greater than the optimal value, and the memory bandwidth is not fully exploited when the real width is less than the optimal value.Also, the width of the 2D MAC array is closely related to the feature size of a CNN model.For example, if the feature size is 56 × 56, it is better to deploy 28 columns of MAC units instead of 32, which achieves the same throughput performance with fewer MAC units.The height of the 2D MAC array is closely related to the number of output channels in a CNN model.A common divisor of all the output channel numbers in all convolutional layers is most preferred.A power of two may also be a good choice for the height of the 2D MAC array.

Buffer Settings
As shown in Figure 5, we deploy three buffers for caching data on-chip: The weight buffer for kernels in convolutional layers and weights in fully connected layers, the feature buffer for input features and the output buffer for output features.Each buffer is composed of multiple independent Block RAMs and the number of Block RAMs is carefully selected to perfectly match the needs of the 2D MAC array.During the process of matrix multiplication, mr rows of kernel data and mc columns of feature data are required at every cycle.To offer enough data ports to the 2D MAC array, we assign mr Block RAMs in the weight buffer and mc + 2 × pad Block RAMs in the feature buffer.The additional 2 × pad Block RAMs are for the padding data when the convolution window slides to the edges.As the 2D MAC array calculates mr × mc pixels of the output feature simultaneously, we assign mc Block RAMs in the output buffer.Once the mr × mc results are generated, they will be cached in the output buffer in mr cycles, which is much shorter than the matrix multiplication latency.Meanwhile, we can adopt the burst access pattern when storing the output feature back to the off-chip memory, which lifts the memory bandwidth utilization rate.
To save on-chip memory consumption, the weight buffer stores only mr groups of kernels on-chip, the feature buffer stores k + stride rows for each channel of the input feature, and the output buffer stores only mr rows (one row for each channel) of the output feature.To achieve pipelining between memory access and computation, the input buffer pre-caches stride rows for each channel of the input feature during the matrix multiplication and the ping-pong strategy is used on the output buffer.Therefore, the memory access time is overlapped with the computation time to the most extent.

Accelerator Architecture
Figure 5 shows an overview of the uniform accelerator architecture for 2D and 3D CNNs.The major components include a controller for fetching instructions and orchestrating all other modules during executing instructions, three buffers for caching input features, weights, and output features respectively, a mapping module to flatten and rearrange input features to feature matrices, and a 2D MAC array to compute matrix multiplications.There are some supplementary function units: The ACC unit for accumulations of two convolutional layers, the BN unit for batch-norm and scale layers, the NL unit for nonlinear activation functions, and some FIFOs for matrix dataflow during computing matrix multiplications.All buffers are connected through data buses to external DDR3/4 memory.The input features and weights are initialized to the DDR memory by the host CPU, which can be a server CPU or an embedded processor depending on the application scenario.
Given an implemented accelerator, a specific CNN model is computed layer-by-layer by a sequence of macro-instructions.The instructions are fed to the controller directly by the host CPU through the instruction bus.Table 1 lists all the macro-instructions used in our architecture.Basically, each instruction is corresponding to a layer type in CNNs.A sum layer is specially introduced due to the splitting strategy.In the case when the input feature of a convolutional layer has too many input channels to be stored in the feature buffer, the convolutional layer will be split into multiple convolutional layers with less input channels.Sum layers are then used to accumulate the results of these convolutional layers.The batch-norm, scale, and ReLu layer can be combined with the convolutional or fully-connected layers ahead so no independent instructions are for them.As shown in Table 2, each macro-instruction is 128-bits long, including the opcode that indicates the layer type, and a series of parameters listed below.The meaning of the left parameters are the same as above.

•
Ix, the height of the input feature; • Ox, the height of the output feature; • tm_max, the number of weight matrix blocks; • tc_max, the number of feature matrix blocks; • pad, the number of padding rows and padding columns; • bn_opt, option indicating whether there are batch normalization and scale operations; • nl_opt, option indicating whether there is non-linear function;

Accelerator Implementation
As a case study, we implement an accelerator prototype based on the uniform architecture with the HLS methodology.The pseudo-code in Figure 6 (left) demonstrates the working process of a convolutional layer.The weight buffer, feature buffer, and output buffer are declared respectively with specified data types.We adopt fixed-point arithmetic logic units in our implementation.As shown in Figure 6 (right), each kernel is represented by eight bits including one sign bit and seven fraction bits, each pixel in input and output features is represented by 16 bits including one sign bit, seven integer bits, and eight fraction bits, and each intermediate result is represented by 32 bits including one sign bit, 16 integer bits, and 15 fraction bits.The intermediate results are represented by 32 bits to preserve precision during accumulations and will be truncated to 16 bits before writing back to memory.The weight buffer is completely partitioned in the row dimension with the array_partition pragma.The feature buffer and output buffer are completely partitioned in the column dimension.The core functions include the load-weight function (line 10), the load-feature function (line 14), the matrix-mapping function (line 17), the matrix-multiply function (line 18), and the store-feature function (line 20).As the function name reflects, the load-weight function loads weights from the off-chip memory to the weight buffer; the load-feature function loads the input feature from the off-chip memory to the input buffer; the store-feature function stores the output feature from the output buffer back to the off-chip memory; the matrix-mapping function is corresponding to the matrix mapping module; and the matrix-multiply function is corresponding to the 2D MAC array.multiplications in 3D convolutional layers by 70.4% at the cost of 101% more additions.The additions are executed with LUTs instead of DSP slices.That is why the implementation in [?] has a performance density greater than 2.0 and a much better performance density overall.Table ?? lists all the known accelerator implementations for 3D CNNs.Our accelerator achieves state-of-the-art performance in terms of throughput.As shown in [?], the Winograd algorithm reduces the computation complexity significantly in CNNs.However, there are still some limitations in terms of flexibility.The Winograd algorithm can only be applied when the convolution stride is 1 and it varies with the size of convolution kernels.By comparison, our framework with matrix multiply approach is more generic and can adapt to varied strides and convolution kernels.On the other hand, we have to acknowledge that the Winograd algorithm is perfectly fit for accelerating C3D, as it has a uniform kernel size of 3 × 3 × 3 and a fixed stride of 1 in all convolutional layers.We notice that the implementations in [?] adopt the 3D Winograd algorithm.Our work has shown that 2D CNN accelerators can be used to accelerate 3D CNNs with slight modifications.We will show how the 2D Winograd algorithm performs to accelerate 3D CNNs in our future work.

VII. CONCLUSION
This paper summarizes our recent work on hardware accelerator design for CNNs.We analytically find that 3D convolutions can be decomposed as the accumulation of multiple 2D CNNs.Therefore, current accelerator designs for 2D CNNs can also be used to accelerate 3D CNNs by appending an accumulation module working along the temporal dimension.For demonstration, we propose a scalable framework for 3D CNNs and implement an accelerator for a real-life 3D CNN model.Evaluation results show that our accelerator achieves state-of-the-art performance compared to other FPGA implementations.Future work includes further demonstrations on CNN models with variable kernel sizes and ASIC implementations for computer vision applications based on the FPGA prototype.As shown in [19], the Winograd algorithm reduces the computation complexity significantly in CNNs.However, there are still some limitations in terms of flexibility.The Winograd algorithm can only be applied when the convolution stride is 1 and it varies with the size of convolution kernels.By comparison, our framework with matrix multiply approach is more generic and can adapt to varied strides and convolution kernels.On the other hand, we have to acknowledge that the Winograd algorithm is perfectly fit for accelerating C3D, as it has a uniform kernel size of 3 × 3 × 3 and a fixed stride of 1 in all convolutional layers.We notice that the implementations in [19] In the case when the on-chip memory is limited, kdepth and idepth are specified by users under 369 the memory constraint.For some convolutional layers with a large number of input channels, kdepth 370 may be less than the width of the weight matrix or idepth may be less than the height of the feature 371 matrix.The splitting strategy will split the convolutional layer to multiple convolutional layers with 372 less number of input channels to fit to the weight buffer and feature buffer.

373
The output buffer stores output features in mc Block RAMs.The ping-pong strategy is adopted 374 for the output buffer.Each pixel in output features is represented as two bytes and hence the width of 375 each Block RAM is 16.Assuming the depth of each Block RAM is odepth, the total on-chip memory 376 consumed by the output buffer can be calculated by mc × odepth × 4 bytes.The depth of each Block

377
RAM is given by Equation 8if the on-chip memory is abundant.
In the case when the on-chip memory is limited or the feature width is too large, odepth can be 379 specified by users under the memory constraint.The 2D MAC array may be under-utilized in some 380 convolutional layers with large feature width.The real unrolling factor of the output channel loop is 381 less than mr, which is given by the following equation:

Computation Optimization with HLS Pragmas
The dataflow optimization is adopted to improve the throughput performance.The dataflow pragma enables task-level pipelining, allowing functions and loops to overlap in their operation, and increasing the concurrency of the RTL implementation.As shown in Figure 6 (left), the dataflow pragma is specified within the tc-loop, th-loop and tm-loop respectively.Figure 7a illustrates the working process of the tc-loop without dataflow pipelining.The matrix-mapping function and the matrix-multiply function are processed sequentially.With dataflow pipelining, the matrix-multiply function can begin several cycles after the matrix-mapping function begins, and are almost completely overlapped with the matrix-mapping function, as illustrated in Figure 7b.Therefore, the latency of the tc-loop with dataflow pipelining is shortened significantly.Figure 7c shows how the dataflow optimization works on the th-loop.The load-feature function and the store-feature function are fully overlapped by the tc-loop.The th-loop is pipelined and the pipeline interval equals to the maximum latency of the three parts.Figure 7d illustrates the tm-loop with dataflow pipelining.Notice that the matrix-multiply function is dependent to the weight matrix and hence the th-loop has to wait until the load-weight function is done.The latency of the th-loop is typically much longer than the latency of the load-weight function.For example, in the second convolutional layer of the C3D model, the th-loop takes 37,355 cycles while the load-weight function takes only 578 cycles.In this case, the ping-pong strategy can reduce the execution time by at most 1.5%.Therefore, the ping-pong strategy is not adopted on the weight buffer to save on-chip memory consumption.That is why the load-weight function is not fully overlapped with the th-loop.The HLS pragmas unroll and pipeline are used inside these functions to reduce latency and enhance throughput performance.Figure 6 (right) shows the HLS pseudocode of the matrix-multiply function.The unroll pragma enables some or all loop iterations to occur in parallel by creating multiple copies of the loop body in the RTL design.In the matrix-multiply function, mr × mc multipliers and adders are created shaping the 2D MAC array and hence mr × mc multiply-accumulations are computed concurrently.The pipeline pragma helps to reduce the initiation interval for a loop by allowing the concurrent execution of operations.The initiation interval in the matrix-multiply function is one cycle after optimization.To summarize, the total execution latency is greatly reduced and the system throughput is enhanced significantly owing to the HLS pragmas.

Memory Optimization
Memory optimizations are made to better use the external memory bandwidth.We expand the data access width using the HLS data_pack pragma, which packs the data fields of a struct into a single scalar with larger byte length.The data_pack pragma helps to reduce the memory access time and all members of the struct can be read and written simultaneously.In our implementation, mr weights are packed to a struct for the load-weight function, mc + 2 × pad pixels are packed for the load-feature function, and mc pixels are packed for the store-feature function.
In addition, we optimize the storage pattern of the features in the external memory space.Since the load-feature function loads one entire row of all input channels every time, the store-feature function stores the features back to the off-chip memory according to the frame-height-channel-width order instead of the frame-channel-height-width order.Therefore, the load-feature function and the store-feature function can transfer data between the on-chip buffer and the off-chip memory in a burst access mode, which lifts the bandwidth utilization rate significantly.The original map is still stored according to the frame-channel-height-width order as we do not want introduce additional work.The large width of the original map guarantees the burst access length in the load-feature function of the first convolutional layer.With the above memory optimizations, the memory bandwidth is no longer the bottleneck.In our implementation, the load-feature and store-feature functions are fully overlapped by the matrix-multiply function.That is to say, the 2D MAC array is only idle during setting up and flushing the pipeline, and will be fully utilized once the pipeline is ready in convolutional layers.

Resource Modeling
An FPGA has several different types of resources of which DSP slices and on-chip memory (Block RAM) have the most effect on the configuration of a CNN accelerator.Since we are using fixed-point arithmetic in our architecture, the only component that consumes DSP slices is the multiplier in the MAC units.Each multiplier consumes one DSP slice.Hence, the total consumption of DSP slices is given by mr × mc, which should be less than the total number of available DSP slices.
In terms of on-chip memory utilization, we analytically model the consumption of the weight buffer, input buffer and output buffer.There are mr Block RAMs in the weight buffer.Each weight is represented as one byte and hence the width of each Block RAM is 8. Assuming the depth of each Block RAM is kdepth, the total on-chip memory consumed by the weight buffer is mr × kdepth bytes.The depth of each Block RAM is given by Equation ( 6) if the on-chip memory is abundant.The layer_num in the equation indicates the total number of layers in a CNN model.
The feature buffer stores input features in mc + 2 × pad Block RAMs.The additional 2 × pad Block RAMs are introduced due to the padding required at the edges.Each pixel in input features is represented as two bytes and hence the width of each Block RAM is 16.Assuming the depth of each Block RAM is idepth, the total on-chip memory consumed by the input buffer can be calculated by (mc + 2 × pad) × idepth × 2 bytes.The depth of each Block RAM is given by Equation ( 7) if the on-chip memory is abundant.
In the case when the on-chip memory is limited, kdepth and idepth are specified by users under the memory constraint.For some convolutional layers with a large number of input channels, kdepth may be less than the width of the weight matrix or idepth may be less than the height of the feature matrix.The splitting strategy will split the convolutional layer to multiple convolutional layers with less number of input channels to fit to the weight buffer and feature buffer.
The output buffer stores output features in mc Block RAMs.The ping-pong strategy is adopted for the output buffer.Each pixel in output features is represented as two bytes and hence the width of each Block RAM is 16.Assuming the depth of each Block RAM is odepth, the total on-chip memory consumed by the output buffer can be calculated by mc × odepth × 4 bytes.The depth of each Block RAM is given by Equation ( 8) if the on-chip memory is abundant.
In the case when the on-chip memory is limited or the feature width is too large, odepth can be specified by users under the memory constraint.The 2D MAC array may be under-utilized in some convolutional layers with large feature width.The real unrolling factor of the output channel loop is less than mr, which is given by the following equation:

Performance Modeling
The main operations in matrix multiplication are multiplications and additions, conducted by the 2D MAC array.As there are mr × mc MAC units, mr × mc × 2 operations are processed every clock cycle in the ideal case.Therefore, the peak throughput is given by mr × mc × 2 × f OP/s, where f is the clock frequency.
The actual throughput is the total operations divided by the total execution time.The total operations in a convolutional layer can be calculated by Equation (10).In 2D convolutional layers, l = 1 and d = 1.
Owing to the dataflow optimization, the functions are working in a pipelined way and some functions are even completely overlapped by others, as illustrated in Figure 7.The total execution cycles for a convolutional layer can be calculated by Equation (11) where ccl ldw , ccl ld f , and ccl st f indicate the execution cycles of the load-weight, load-feature and store-feature function respectively.I I th indicates the pipeline interval of the th-loop in Figure 6.
The matrix-mapping function takes c × k × k cycles to generate a feature matrix block and the matrix-multiply function also takes c × k × k cycles to complete a matrix multiplication.Hence, the total cycles taken by the tc-loop in Figure 6 is given by: The execution time of the other functions are closely related to the real memory access bandwidth.We get simplified models for the memory-related functions under sufficient memory access bandwidth: The load-weight function takes c × k × k cycles, the load-feature function takes c × stride × w/mc cycles, and the store-feature function takes mr × w/mc cycles.The pipeline interval of the th-loop is the maximum value of ccl ld f , ccl st f and ccl tc .

Evaluation
In this section, we first introduce the experimental setup and then detailed experimental results are provided.

Experimental Setup
We evaluate our design by implementing an accelerator on a Xilinx VC709 board that contains a Xilinx XC7VX690T FPGA chip with 3600 DSP slices.We select three typical CNN models for test: AlexNet, VGG16, and C3D.AlexNet and VGG16 are 2D CNN models for image classification, and C3D is a 3D CNN model for human action recognition.The channel numbers in the convolutional layer include 64, 128, 256, and 512.The feature sizes include 224 × 224, 112 × 112, 56 × 56, 28 × 28, and 14 × 14.Considering the total number of available DSP slices, we assign 64 rows and 56 columns for the 2D MAC array.Accordingly, the weight buffer has 64 Block RAMs with depth of 5120, the feature buffer has 60 Block RAMs with depth of 2048, and the output buffer has 56 Block RAMs with depth of 512.The accelerator design is implemented with Vivado HLS and synthesized with Vivado Design Suite 2018.1.The software versions are implemented with Caffe [8], a deep-learning framework, running on an Intel Core i5-4440 CPU (@3.10 GHz) and an NVIDIA GPU (Titan X Pascal).The Caffe framework used in the software implementations is the original one developed by the Berkeley Vision and Learning Center (BVLC) with no special optimizations for the Intel architecture.There are no fixed-point arithmetic units in the CPU or the GPU, and hence the software versions are implemented with floating-point arithmetic.The dataset for testing AlexNet and VGG16 is ImageNet [20] and the dataset for testing C3D is UCF1017 [21].We use a batch size of 16 for both versions during testing.

Experimental Results
Table 3 presents the hardware resource utilization of our accelerator.The number of DSP slices consumed by our accelerator is 3595, which uses nearly 100% of the available DSP slices.The 2D MAC array consumes 3584 of the 3595 DSP slices for computing matrix multiplications and the other 11 DSPs are used for calculating addresses during memory access.In addition, the accelerator uses 391 BRAMs, which uses less than 27% of the available BRAMs.The BRAMs are mainly consumed by the weight buffer, feature buffer, and output buffer.The utilization results show that our architecture can fully utilize the computation resources while consume very few memory resources.Especially, the splitting strategy makes the architecture even less dependent on the on-chip memory capacity.Therefore, our architecture can be deployed to platforms with limited on-chip memory capacity and is very friendly for ASIC implementations.Figure 8 presents the evaluation results of each layer in the three CNN models.According to the performance model, the peak throughput of our accelerator at 120 MHz is 860.2GOP/s.The conv2a layer achieves the highest throughput of 811.5 GOP/s in AlexNet, the conv1b layer achieves the highest throughput of 856.1 GOP/s in VGG16, and the conv2a layer achieves the highest throughput of 851.2 GOP/s in C3D.The first convolutional layer in all three CNN models has only three input channels which will make the 2D MAC array under-utilized.Since the C3D has a temporal depth of three, the real input channels in the first convolutional layer is nine.That is why the conv1a layer achieves a poor throughput in AlexNet and VGG16 while achieves a much higher throughput in C3D.We can also find from Figure 8 that the throughput performance of each layer decreases with the layer depth.The reason is that the feature size decreases owing to the pooling layers and hence the th-loop iterates less times in latter convolutional layers.As shown in Equation (11), ccl ldw and ccl ld f will account for more percentage of the total execution cycles when h decreases.In fully-connected layers, the 2D MAC array is under-utilized restricted by the memory bandwidth.Thus, the throughput of fully-connected layers is far lower than that of convolutional layers, around 40 GOP/s.That is why the average throughput of the convolutional layers is more than 407.2GOP/s in AlexNet while the overall throughput is only 231.6 GOP/s.By comparison, VGG16 and C3D have more convolutional layers and hence fully connected layers have less effect to the overall throughput.The accelerator achieves an overall throughput of 691.6 GOP/s on VGG16 and 667.7 GOP/s on C3D.density among all the listed implementations.In terms of throughput, our accelerator achieves state-of-the-art performance on both VGG and C3D.The implementation in [19] achieves the best throughput performance with a much higher clock frequency over other implementations.
As shown in [17], the Winograd algorithm reduces the computation complexity significantly in CNNs.However, there are still some limitations in terms of flexibility.The Winograd algorithm can only be applied when the convolution stride is 1 and the transform matrices vary with the size of convolution kernels.By comparison, our architecture with matrix multiply approach is more generic and can adapt to varied strides and convolution kernels.On the other hand, we have to acknowledge that the Winograd algorithm is perfectly fit for accelerating VGG and C3D, as they have a uniform kernel size of 3 × 3 and a fixed stride of 1 in all convolutional layers.However, it is not suitable for accelerating AlexNet, which has multiple kernel sizes (11 × 11, 5 × 5, and 3 × 3) and multiple strides (4 and 1).It is not a problem for our architecture as we have demonstrated in the evaluation.The architecture in [17] adopts a template-based approach supporting accelerating both 2D and 3D CNNs.Different configurations (e.g., unrolling factors, Winograd algorithms) have to be customized for different CNN models.That is why they implement two accelerators for accelerating VGG16 and C3D respectively.By comparison, our accelerator can accelerate different CNN models with the same configuration.The AlexNet, VGG16, and C3D are run on the same accelerator in our evaluation, and achieve very close throughput performance on convolutional layers.

Figure 1 .
Figure 1.2D and 3D convolution operations.Applying 2D convolution on an image (a) or multiple channels of image (b) results in an image.Applying 3D convolution on a video volume (c) or multiple channels of video volume (d) results in another volume.

Figure 3 .
Figure 3. Generating feature matrix blocks in the matrix mapping module.

Figure 4 .
Figure 4. Partitioning weight and feature matrices to blocks to adapt to the 2D multiply and accumulate (MAC) array.

Figure 6 .
Figure 6.High level synthesis (HLS) pseudocode demonstrating the working process of a convolutional layer (left) and how to compute the matrix multiplication with the 2D MAC array (right).

Figure 7 .
Figure 7. How dataflow optimization works on the functions and loops.
Version December 17, 2018 submitted to Journal Not Specified 12 of 18 [19]ces the multiplications in 3D convolutional layers by 70.4% at the cost of 101% more additions.The additions are executed with LUTs instead of DSP slices.That is why the implementation in[19]has a performance density greater than 2.0 and a much better performance density overall.Table IV lists all the known accelerator implementations for 3D CNNs.Our accelerator achieves state-of-the-art performance in terms of throughput.
adopt the 3D Winograd algorithm.Our work has shown that 2D CNN accelerators can be used to accelerate 3D CNNs with slight modifications.We will show how the 2D Winograd algorithm performs to accelerate 3D CNNs in our future work.