Design of Flexible Hardware Accelerators for Image Convolutions and Transposed Convolutions

Nowadays, computer vision relies heavily on convolutional neural networks (CNNs) to perform complex and accurate tasks. Among them, super-resolution CNNs represent a meaningful example, due to the presence of both convolutional (CONV) and transposed convolutional (TCONV) layers. While the former exploit multiply-and-accumulate (MAC) operations to extract features of interest from incoming feature maps (fmaps), the latter perform MACs to tune the spatial resolution of the received fmaps properly. The ever-growing real-time and low-power requirements of modern computer vision applications represent a stimulus for the research community to investigate the deployment of CNNs on well-suited hardware platforms, such as field programmable gate arrays (FPGAs). FPGAs are widely recognized as valid candidates for trading off computational speed and power consumption, thanks to their flexibility and their capability to also deal with computationally intensive models. In order to reduce the number of operations to be performed, this paper presents a novel hardware-oriented algorithm able to efficiently accelerate both CONVs and TCONVs. The proposed strategy was validated by employing it within a reconfigurable hardware accelerator purposely designed to adapt itself to different operating modes set at run-time. When characterized using the Xilinx XC7K410T FPGA device, the proposed accelerator achieved a throughput of up to 2022.2 GOPS and, in comparison to state-of-the-art competitors, it reached an energy efficiency up to 2.3 times higher, without compromising the overall accuracy.


Introduction
In the last few years, deep learning algorithms, particularly convolutional neural networks (CNNs), have attracted considerable interest in several computer vision tasks, ranging from object detection [1] to image classification [2] and segmentation [3]. In such applications, the ever-growing success of CNNs is accompanied by a continuous increase in both accuracy and computational complexity. As an example, in the case of image classification, moving from the eight-layered AlexNet [4] to the 152-layered ResNet [5] the error rates have been reduced by more than 10%, but the amount of performed multiplyand-accumulate (MAC) operations has increased by more than 80%. Such a trend makes evident that ad-hoc designed hardware accelerators are essential for deploying CNN algorithms in real-time and power-constrained systems [6].
Most recently, the capability of reconstructing high-resolution images from lowresolution ones by means of pixel estimation, which is known as super resolution (SR) imaging, has become crucial in several applications, such as video surveillance, medical diagnosis, and remote sensing. Also in this field, CNNs have gained enormous popularity [7] and, thanks to the ability of learned filters to extrapolate new features from low-resolution images, they have demonstrated appreciable quality improvements with respect to conventional methods [8,9]. Unfortunately, because of the different nature of the final task

•
A comprehensive evaluation of the state-of-the-art TCONV algorithms suitable for implementation in hardware is provided. • An original TCONV approach, thought to avoid complex remapping of filter coefficients and suitable for exploitation also in CONV operations, is presented. • A flexible reconfigurable hardware accelerator is proposed. It was purposely designed to adapt itself at run-time to two operating modes and to different kernel sizes, as required to support all operations employed in both CONV and TCONV layers. • For evaluation purposes, the novel method was exploited in the context of SR imaging, and the proposed reconfigurable hardware architecture was used to accelerate the popular fast super resolution CNN (FSRCNN) [10]. The experiments, performed on the Xilinx XC7K410T field programmable gate array (FPGA) chip, demonstrated the benefits of the proposed approach in terms of area occupancy and energy saving over several state-of-the-art counterparts. In fact, the new accelerator exhibited a logic resource requirement and a power consumption up to~63% and~48% lower, respectively, than previous designs [11,[13][14][15][16][17]. The adopted parallelism and the achieved 227 MHz running frequency allow the above advantages to be obtained without compromising the competitiveness of the proposed design in terms of speed performance.
The reminder of this paper is structured as follows: Section 2 provides a background and a survey of previous works; the novel algorithm and the hardware architecture onpurpose designed are presented in Sections 3 and 4; the experimental results are discussed in Section 5, which also includes a comparison to state-of-the-art accelerators implemented on the FPGA in terms of hardware characteristics and quality metrics. Finally, Section 6 concludes this manuscript.

Background and Related Works
The CNNs employed in SR imaging tasks [7] often include a feature extractor, consisting of several cascaded CONV layers, followed by an up-sampling stage consisting of a certain number of cascaded TCONV layers. The generic layer receives a volume of M input feature maps (ifmaps), each of a H i × W i size, and a set of N filters {F 0 , F 1 , . . . , F N−1 }, each consisting of M kernels of a k × k size. The specific operations performed by the layer produce a volume of N output feature maps (ofmaps), each of a H o × W o size, with H o and W o being defined as H o = (S D × H i + 2P) − k + 1 and W o = (S D × W i + 2P) − k + 1, S D and P being, respectively, the up-sampling factor and the size of padding on the borders.
In the case of CONVs, S D = 1 and, to generate the h-th ofmap, the volume of ifmaps is convolved with the corresponding filter F h . Then, the M results obtained in this way are summed up by a pixel-wise addition. Conversely, a TCONV layer refers to S D > 1 and requires the generic ifmap to be preliminarily up-sampled by interleaving actual input activations with S D − 1 additional rows and columns. After this, the operations involved are the same as those of a conventional CONV layer. The example illustrated in Figure 1 shows the operations performed to process a 2 × 2 ifmap with a 3 × 3 filter when S D = 2. It is worth noting that the additional elements introduced in the up-sampled ifmap can be filled either by zeros [18] (in the following, this approach is named the zero-TCONV) or by interpolating the nearest neighboring (NN) values to reduce possible chessboard effects [19]. Regardless, knowing the size H i × W i of the original ifmap, the up-sampling factor S D , and the size P of padding on the borders, the size H o × W o of the up-sampled ifmap is given by Equation (1). consisting of M kernels of a k × k size. The specific operations performed by the layer produce a volume of N output feature maps (ofmaps), each of a Ho × Wo size, with Ho and Wo being defined as = ( × + 2 ) − + 1 and = ( × + 2 ) − + 1, SD and P being, respectively, the up-sampling factor and the size of padding on the borders.
In the case of CONVs, SD = 1 and, to generate the h-th ofmap, the volume of ifmaps is convolved with the corresponding filter Fh. Then, the M results obtained in this way are summed up by a pixel-wise addition. Conversely, a TCONV layer refers to SD > 1 and requires the generic ifmap to be preliminarily up-sampled by interleaving actual input activations with SD−1 additional rows and columns. After this, the operations involved are the same as those of a conventional CONV layer. The example illustrated in Figure 1 shows the operations performed to process a 2 × 2 ifmap with a 3 × 3 filter when SD = 2. It is worth noting that the additional elements introduced in the up-sampled ifmap can be filled either by zeros [18] (in the following, this approach is named the zero-TCONV) or by interpolating the nearest neighboring (NN) values to reduce possible chessboard effects [19]. Regardless, knowing the size Hi × Wi of the original ifmap, the up-sampling factor SD, and the size P of padding on the borders, the size Ho × Wo of the up-sampled ifmap is given by Equation (1). Since they process up-sampled ifmaps, it is obvious that, with respect to CONVs, TCONVs require more MAC operations and larger amounts of data memory. Unfortunately, these characteristics may represent a bottleneck for those application scenarios in which real time and low power are mandatory. For this reason, designing ad-hoc hardware accelerators suitable for exploitation also within time-and power-constrained operating environments has recently received a great deal of attention [11][12][13][14][15][16][17][19][20][21][22][23]. Among the possible hardware realization platforms, FPGAs are widely recognized as powerful solutions [11,13,15,17,20] for merging the benefits from custom hardware designs, such as computational parallelism and limited energy consumption, with the strengths of software designs, including reconfigurability and short time to market.
While several of the existing hardware designs support both CONVs and TCONVs [11,[13][14][15][16][17]19,21], some of them are tailored to accomplish only TCONVs [12,22,23]. As an example, the FPGA accelerator proposed in our previous work [12] deals with the inputoriented method (IOM) to reduce, or completely avoid, useless operations, corresponding to multiplications by zero, introduced by the conventional zero-TCONVs' up-sampling approach. This is made possible by computing the products between each input pixel and the k × k elements of the filter, and then properly arranging the k × k results within the ofmap. Obviously, as a drawback, designs [12,22,23] need either additional buffers or auxiliary computing resources, or both, to manage row/column overlaps. Moreover, they may result quite inefficient when the CNN model being accelerated also uses CONV layers, as happens in the case of SR imaging applications [10,11]. Since they process up-sampled ifmaps, it is obvious that, with respect to CONVs, TCONVs require more MAC operations and larger amounts of data memory. Unfortunately, these characteristics may represent a bottleneck for those application scenarios in which real time and low power are mandatory. For this reason, designing ad-hoc hardware accelerators suitable for exploitation also within time-and power-constrained operating environments has recently received a great deal of attention [11][12][13][14][15][16][17][19][20][21][22][23]. Among the possible hardware realization platforms, FPGAs are widely recognized as powerful solutions [11,13,15,17,20] for merging the benefits from custom hardware designs, such as computational parallelism and limited energy consumption, with the strengths of software designs, including reconfigurability and short time to market.
While several of the existing hardware designs support both CONVs and TCONVs [11,[13][14][15][16][17]19,21], some of them are tailored to accomplish only TCONVs [12,22,23]. As an example, the FPGA accelerator proposed in our previous work [12] deals with the input-oriented method (IOM) to reduce, or completely avoid, useless operations, corresponding to multiplications by zero, introduced by the conventional zero-TCONVs' up-sampling approach. This is made possible by computing the products between each input pixel and the k × k elements of the filter, and then properly arranging the k × k results within the ofmap. Obviously, as a drawback, designs [12,22,23] need either additional buffers or auxiliary computing resources, or both, to manage row/column overlaps. Moreover, they may result quite inefficient when the CNN model being accelerated also uses CONV layers, as happens in the case of SR imaging applications [10,11].
The designs recently presented in [11,[13][14][15] overcome the aforementioned issues by exploiting uniform accelerators for both CONVs and TCONVs. Starting from an analysis of the input-oriented method (IOM), and with the objective of avoiding overlapping on input activations, the computational scheme proposed in [11] performs an inverse mapping on the filter coefficients. More specifically, the transform deconvolution into convolution (TDC) approach [11] converts each filter of a TCONV into S D 2 smaller sub-filters according to the relative position of the original input activations within the up-sampled ifmap. Due to this splitting strategy, several locations within the sub-filters contain zero values, thus causing unbalanced computations. Moreover, the configuration (i.e., size and number of sub-filters) depends on SD. Therefore, the splitting process has to be performed offline and the preprocessed filters must be stored on chip, thus limiting the possibility of reconfiguring at run-time the architecture to accelerate different CNNs. As observed in [13], when the zero-TCONV approach is used, the filter coefficients that are being multiplied by zero activations can be removed by decomposing filters into several sub-blocks. Also for this decomposition algorithm, the filters must be pre-processed offline. Moreover, in order to remove unbalanced computations, an overall logic more complex than [11] is required.
To manage both TCONV and CONV operations, the hardware designs proposed in [14,15] decompose filters into smaller sub-blocks with different dimensions, according with the values of k and SD; then, to avoid filter reversal and zero padding on the borders, they apply a variant of the conventional Winograd algorithm. In such a case, unconventional computational modules, suitable for implementing operations involved in the Winograd transformation (such as inverse transformation of a matrix), are required.
The FlexiGAN architecture presented in [21] infers the conventional zero-TCONV operations, but, in order to improve the computational efficiency, it recognizes rows filled with zeros and skips them during the MAC operations. However, the auxiliary circuitry needed to properly reorganize the ifmaps and the filters significantly affect the logic and memory resource requirements, as well as the power consumption.

The Hardware-Oriented Algorithm Proposed to Convert TCONVs into CONVs
The novel algorithm here presented exploits a computational strategy quite different than previous works [11][12][13][14][15]. In contrast to [11,[13][14][15], which manipulate the k × k filter coefficients to form smaller sub-blocks (thus introducing the necessity of offline elaborations), and with respect to [12] that re-arranges the position of output values within the ofmaps (leading to area and time overhead due to the management of the overlapping regions), it applies an unconventional remapping strategy directly to the incoming ifmaps values. From a hardware perspective, this means that: (1) The process occurs online and the preprocessing is not required, and (2) the result of the proposed algorithm can be outputted as soon as it is produced, thus avoiding additional time and buffering/computing resources. As a further advantage, the incoming ifmaps are not actually up-sampled, but instead are processed as if they were up-sampled with the zero-TCONV approach.
In order to achieve high-speed performance and to prevent useless multiplications by zero, the proposed method was on-purpose made able to furnish S D × S D results in parallel for each computed ofmap. The steps illustrated in Figure 2a are performed to process the K C × K C window of activations, with The generic sliding window received as input, with the first (i.e., the top-left) activation of the window being I i,j (with i = 0, . . . ,H i − 1 and j = 0, . . . ,W i − 1), is remapped within a k × k window; then, element-wise multiplications are performed between the remapped window and the k × k filter, followed by accumulations to produce S D × S D parallel results. The main innovation introduced with respect to the conventional approach and methods based on filter decomposition [11,[13][14][15] is the remapping of the K C × K C input activations within the sliding window RI. The latter is formed as illustrated in Figure 2b, which also shows the local row and column indices m and n, both varying from 0 to k − 1. The remapped window is obtained by applying the following basic rules:

•
The first activation I i,j is assigned to the local position (0,0) within the up-sampled window RI and replicated no more; • The activations with a row index equal to i are replicated S D times horizontally; • The activations with a column index equal to j are replicated S D times vertically; • The activations with row and column indices varying, respectively, from i + 1 to i + K C − 2 and from j + 1 to j + K C − 2, are replicated S D times vertically and S D times horizontally, thus forming S D × S D sub-windows, as illustrated in Figure 2b; • If (k − 1) mod S D = 0, the activations with a row index equal to K C − 1 are replicated S D times horizontally (this is the case illustrated in Figure 2b); otherwise, they are replicated (k − 1) mod S D times; • If (k − 1) mod S D = 0, the activations with a column index equal to K C − 1 are replicated S D times vertically (this is the case illustrated in Figure 2b); otherwise, they are replicated (k − 1) mod S D times.
J. Imaging 2021, 7, 210 5 of 17 and column indices m and n, both varying from 0 to − 1. The remapped window is obtained by applying the following basic rules:


The first activation Ii,j is assigned to the local position (0,0) within the up-sampled window RI and replicated no more;  The activations with a row index equal to i are replicated SD times horizontally;  The activations with a column index equal to j are replicated SD times vertically;  The activations with row and column indices varying, respectively, from i + 1 to i + KC−2 and from j + 1 to j + KC−2, are replicated SD times vertically and SD times horizontally, thus forming SD × SD sub-windows, as illustrated in Figure 2b;  If ( − 1) = 0, the activations with a row index equal to KC−1 are replicated SD times horizontally (this is the case illustrated in Figure 2b); otherwise, they are replicated ( − 1) times;  If ( − 1) = 0, the activations with a column index equal to KC−1 are replicated SD times vertically (this is the case illustrated in Figure 2b); otherwise, they are replicated ( − 1) times. The elements of the remapped window, obtained as explained above, are multiplied by the homologous filter coefficients Wm,n that do not require any type of rearrangement. Then, the computed k × k products PPm,n are properly accumulated to finally provide the SD × SD parallel results × + , × + , with p and q varying from 0 to SD-1. To take into account the up-sampling factor SD, the generic result × + , × + must be computed by accumulating KC × KC products PPmm,nn picked up starting from the location = × , = × and going on as in a chessboard with horizontal and vertical jumps of SD positions (i.e., with stride SD). However, it is worth noting that some jumps lead to values of mm and/or nn exceeding k, thus indexing unavailable products. Actually, referring to the ifmap currently processed as if it were up-sampled with the zero-TCONV approach, it is easy to verify that these missing products correspond to multiplications by zero. Therefore, they do not contribute to the accumulate operations and can simply be ignored. As a consequence, the results computed with the proposed strategy have the same values provided by the conventional zero-TCONV approach [18]. However, the The elements of the remapped window, obtained as explained above, are multiplied by the homologous filter coefficients W m,n that do not require any type of rearrangement. Then, the computed k × k products PP m,n are properly accumulated to finally provide the S D × S D parallel results O i×S D +p,j×S D +q , with p and q varying from 0 to S D − 1. To take into account the up-sampling factor S D , the generic result O i×S D +p,j×S D +q must be computed by accumulating K C × K C products PP mm,nn picked up starting from the location mm = i × S D , nn = j × S D and going on as in a chessboard with horizontal and vertical jumps of S D positions (i.e., with stride S D ). However, it is worth noting that some jumps lead to values of mm and/or nn exceeding k, thus indexing unavailable products. Actually, referring to the ifmap currently processed as if it were up-sampled with the zero-TCONV approach, it is easy to verify that these missing products correspond to multiplications by zero. Therefore, they do not contribute to the accumulate operations and can simply be ignored. As a consequence, the results computed with the proposed strategy have the same values provided by the conventional zero-TCONV approach [18]. However, the method proposed here completely avoids multiplications by zero and filter partitioning. The software model of the proposed method is reported in Appendix A.
It is important to highlight that the remapping strategy proposed here is a different point of view of the methods based on filters decomposition [11,[13][14][15]. Indeed, while the latter re-arrange filter coefficients to perform proper element-wise multiplications, the former re-arrange input activations. However, as discussed in Section 5, the proposed strategy is more efficient from the hardware perspective, because it allows online computations and does not require complex architectures to manage the remapping.
To better explain the novel computational scheme, let us consider the example in Figure 3 that refers to k = 9, S D = 2, and K C = 5. In this case, the local row and column indices m and n vary from 0 to 8. Therefore, for each input pixel I i,j , the above-explained basic rules lead to the remapped 9 × 9 window visible in Figure 3a, where the 5 × 5 elements of the original sliding window are highlighted in blue. It can be observed that the remapped window collects all of the data needed to compute the results O i×S D +p,j×S D +q contemporaneously, with indices p and q, used to locate the produced results within the ofmap, ranging between 0 and 1. Indeed, since S D = 2, the results O i×2, j×2 , O i×2, j×2+1 , O i×2+1, j×2 , and O i×2+1, j×2+1 are computed as given in Equation (1).
J. Imaging 2021, 7, 210 6 of 17 method proposed here completely avoids multiplications by zero and filter partitioning. The software model of the proposed method is reported in Appendix A.
It is important to highlight that the remapping strategy proposed here is a different point of view of the methods based on filters decomposition [11,[13][14][15]. Indeed, while the latter re-arrange filter coefficients to perform proper element-wise multiplications, the former re-arrange input activations. However, as discussed in Section 5, the proposed strategy is more efficient from the hardware perspective, because it allows online computations and does not require complex architectures to manage the remapping.
The computations described above are repeated for each pixel of the ifmap and, upon completion, the H i × W i groups of S D × S D results obtained in this way are arranged in the ofmap, as illustrated in Figure 4. In the figure, different colors are used to highlight each group of S D × S D results computed in parallel. + +4, +1 × 7,1 + +4, +2 × 7,3 + +4, +3 × 7,5 + +4, +4 × 7,7 As expected, the results × 2+ , × 2+ , corresponding to p and/or q greater than zero, are obtained by accumulating less than KC × KC products, and the missing products are simply ignored, since they are related to multiplications by zero.
The computations described above are repeated for each pixel of the ifmap and, upon completion, the Hi × Wi groups of SD × SD results obtained in this way are arranged in the ofmap, as illustrated in Figure 4. In the figure, different colors are used to highlight each group of SD × SD results computed in parallel.

The Proposed Run-Time Reconfigurable Hardware Accelerator
The novel method presented above to convert TCONVs into CONVs is employed within a reconfigurable hardware structure purposely designed to perform both CONVs and TCONVs by run-time, adapting itself to different operating modes.
In order to achieve high computational speeds, the proposed hardware accelerator exploits a certain level of parallelism. In the following, it is shown that the TM ifmaps and TN filters are processed at a time, with TM and TN varying at run-time in accordance with the current operation mode, the kernel size k, and the up-sampling factor SD. For the operations of the generic layer to be completed, regardless of whether it is a CONV or a TCONV layer, ⌈ ⌉ × ⌈ ⌉ steps are needed. Figure 5 depicts the top-level architecture of the proposed hardware accelerator that consists of a computational module (CM) and a finite state machine (FSM). The former receives, as inputs, TM ifmaps and TN filters, each consisting of TM kernels collecting k × k coefficients, and provides TN ofmaps at a time. Conversely, the FSM is fed with the input configuration, which sets the required operating mode (indicating whether CONVs or TCONVs must be performed), the kernel size k, the fmap sizes, and the window size KC, and furnishes proper control/configuration signals to the CM. Through these signals, the FSM configures the CM and supervises the overall data flow. It is worth noting that when S D is 1, K C is equal to k and the sliding window does not require remapping operations; in such a case, the proposed algorithm performs a standard CONV. With the input volume consisting of M ifmaps, all of the computations described above must be repeated M times. The M intermediate ofmaps computed in this way are summed up to populate the volume of the expected N ofmaps.

The Proposed Run-Time Reconfigurable Hardware Accelerator
The novel method presented above to convert TCONVs into CONVs is employed within a reconfigurable hardware structure purposely designed to perform both CONVs and TCONVs by run-time, adapting itself to different operating modes.
In order to achieve high computational speeds, the proposed hardware accelerator exploits a certain level of parallelism. In the following, it is shown that the T M ifmaps and T N filters are processed at a time, with T M and T N varying at run-time in accordance with the current operation mode, the kernel size k, and the up-sampling factor S D . For the operations of the generic layer to be completed, regardless of whether it is a CONV or a TCONV layer, M T M × N T N steps are needed. Figure 5 depicts the top-level architecture of the proposed hardware accelerator that consists of a computational module (CM) and a finite state machine (FSM). The former receives, as inputs, T M ifmaps and T N filters, each consisting of T M kernels collecting k × k coefficients, and provides T N ofmaps at a time. Conversely, the FSM is fed with the input configuration, which sets the required operating mode (indicating whether CONVs or TCONVs must be performed), the kernel size k, the fmap sizes, and the window size K C , and furnishes proper control/configuration signals to the CM. Through these signals, the FSM configures the CM and supervises the overall data flow. The CM splits the incoming TN filters into R groups and employs as many CONV/TCONV units (CTCUs). Each CTCU, depending on the received control and configuration signals, arranges data in proper sliding windows and executes either CONVs or TCONVs by processing the TM ifmaps and its own ⌈ ⌉ filters. The results provided by the CTCUs are then dispatched to the subsequent modules passing through the routing The CM splits the incoming T N filters into R groups and employs as many CONV/TCONV units (CTCUs). Each CTCU, depending on the received control and configuration signals, arranges data in proper sliding windows and executes either CONVs or TCONVs by processing the T M ifmaps and its own T N R filters. The results provided by the CTCUs are then dispatched to the subsequent modules passing through the routing logic purposely designed to take into account that the supported operating modes lead to different data flows. In fact, depending on whether CONVs or TCONVs are performed, the intermediate results related to the current T M input channels must be accumulated by the proper adder trees (ATs). Then, data must be routed either to the ofmaps buffers, which happens when the computation of the current T N ofmaps is not yet completed, or, vice versa, to the parametric rectified linear units (PReLUs) that implement the linear rectification method demonstrated in [24].
The generic CTCU is structured as illustrated in Figure 6. The ifmaps buffer (IFB) and the weights buffer (WB) collect, respectively, the N A -bit pixels of the incoming T M ifmaps and the N W -bit coefficients of the received T N R filters. In particular, the IFB circuit is responsible for arranging the K C × K C -sized sliding windows that will be processed through the proposed algorithm. When TCONVs are executed, the remap unit (RU) performs the first step of the proposed approach. It implements the novel logic discussed above in Section 3 to remap the T M K C × K C sliding windows into as many k × k windows. The T N R CONV/TCONV engines (CTCEs) execute the element-wise multiplications and the accumulations (steps 2 and 3 in Figure 2a); they receive the T M remapped windows and the filters coefficients as arranged, in the meantime, by the WB. When CONVs are executed with kernel sizes greater than 1, the RU is bypassed; thus, the IFB and WB feed directly the CTCE. In the case of 1 × 1 CONVs, both the IFB and the RU are bypassed, thus inputting the ifmaps directly to the CTCE. The CM splits the incoming TN filters into R groups and employs as many CONV/TCONV units (CTCUs). Each CTCU, depending on the received control and configuration signals, arranges data in proper sliding windows and executes either CONVs or TCONVs by processing the TM ifmaps and its own ⌈ ⌉ filters. The results provided by the CTCUs are then dispatched to the subsequent modules passing through the routing logic purposely designed to take into account that the supported operating modes lead to different data flows. In fact, depending on whether CONVs or TCONVs are performed, the intermediate results related to the current TM input channels must be accumulated by the proper adder trees (ATs). Then, data must be routed either to the ofmaps buffers, which happens when the computation of the current TN ofmaps is not yet completed, or, vice versa, to the parametric rectified linear units (PReLUs) that implement the linear rectification method demonstrated in [24].
The generic CTCU is structured as illustrated in Figure 6. The ifmaps buffer (IFB) and the weights buffer (WB) collect, respectively, the NA-bit pixels of the incoming TM ifmaps and the NW-bit coefficients of the received ⌈ ⌉ filters. In particular, the IFB circuit is responsible for arranging the KC × KC-sized sliding windows that will be processed through the proposed algorithm. When TCONVs are executed, the remap unit (RU) performs the first step of the proposed approach. It implements the novel logic discussed above in Section 3 to remap the TM KC × KC sliding windows into as many k × k windows. The ⌈ ⌉ CONV/TCONV engines (CTCEs) execute the element-wise multiplications and the accumulations (steps 2 and 3 in Figure 2a); they receive the TM remapped windows and the filters coefficients as arranged, in the meantime, by the WB. When CONVs are executed with kernel sizes greater than 1, the RU is bypassed; thus, the IFB and WB feed directly the CTCE. In the case of 1 × 1 CONVs, both the IFB and the RU are bypassed, thus inputting the ifmaps directly to the CTCE.  While the WB uses just simple N W -bit shift registers, as shown in Figure 7, the IFB consists of three main parts:

•
The register window (RW), composed of K M × K M N A -bit registers, with K M being set to T M × k × k, thus ensuring that up to T M k × k sliding windows can be accommodated at a time. The sparse multiplexing logic visible in Figure 7 guarantees that the used registers are properly cascaded according to the current value of k.

•
The line shift buffer, used to locally store W i − k pixels of k − 1 rows of each received ifmap, and to perform shift operations, as conventionally required to properly accommodate the sliding windows during the overall computation.

•
The padding logic, used to establish if the current sliding windows must be zeropadded, which occurs when the current anchor points are associated with the bordering pixels of the processed ifmaps.
registers are properly cascaded according to the current value of k.  The line shift buffer, used to locally store Wi−k pixels of k−1 rows of each received ifmap, and to perform shift operations, as conventionally required to properly accommodate the sliding windows during the overall computation.  The padding logic, used to establish if the current sliding windows must be zeropadded, which occurs when the current anchor points are associated with the bordering pixels of the processed ifmaps. Within the CTCE, multiplications and accumulation are performed, respectively, through two different pipeline sub-circuits, here named Type-A (TA) and Type-B (TB). As visible in Figure 8, each tile consists of several processing elements (PEs). The PEs inside the TAs execute MACs, whereas the PEs within the TBs perform two-operand additions. In order to provide a flexible architecture, suitable for performing both CONVs and TCONVs under different operating conditions, the CTCE exploits several TA and TB circuits, which are connected to one another by multiplexers. The latter allow to activate a specific path within the CTCE, depending on the currently processed kernel size. Taking into account that, as observed in the previous sections, at the parity of the kernel size, the TCONVs are more complex than CONVs, the employed sub-circuits TAs and TBs have been organized to comply with the computational capability required by TCONVs in the worst case, thus intrinsically being able to also satisfy the computational requirements of CONVs. As an example, Figure 9 illustrates the design of the CTCE when it has to comply with a 9 × 9 TCONVs at SD = 2. In this regard, 13 TAs and eight TBs are properly arranged to accomplish steps 2 and 3 of the proposed method. The TAs, consisting of 81 PEs, exploit as many multipliers to execute the element-wise matrix multiplication (step 2). Accumulators internal to the TAs, in conjunction with the 12 PEs provided by the TBs, perform the chessboard accumulations (step 3) to furnish the parallel results as in Equation (1). In Figure 9, the SD × SD parallel outputs are labeled as 5 × 5_r0, 5 × 4_r, 4 × 5_r, and 4 × 4_r, respectively. Subsequently, the external module ATs for TCONVs (visible in Figure 5) Within the CTCE, multiplications and accumulation are performed, respectively, through two different pipeline sub-circuits, here named Type-A (TA) and Type-B (TB). As visible in Figure 8, each tile consists of several processing elements (PEs). The PEs inside the TAs execute MACs, whereas the PEs within the TBs perform two-operand additions. In order to provide a flexible architecture, suitable for performing both CONVs and TCONVs under different operating conditions, the CTCE exploits several TA and TB circuits, which are connected to one another by multiplexers. The latter allow to activate a specific path within the CTCE, depending on the currently processed kernel size. Taking into account that, as observed in the previous sections, at the parity of the kernel size, the TCONVs are more complex than CONVs, the employed sub-circuits TAs and TBs have been organized to comply with the computational capability required by TCONVs in the worst case, thus intrinsically being able to also satisfy the computational requirements of CONVs. As an example, Figure 9 illustrates the design of the CTCE when it has to comply with a 9 × 9 TCONVs at S D = 2. In this regard, 13 TAs and eight TBs are properly arranged to accomplish steps 2 and 3 of the proposed method. The TAs, consisting of 81 PEs, exploit as many multipliers to execute the element-wise matrix multiplication (step 2). Accumulators internal to the TAs, in conjunction with the 12 PEs provided by the TBs, perform the chessboard accumulations (step 3) to furnish the parallel results as in Equation (1). In Figure 9, the S D × S D parallel outputs are labeled as 5 × 5_r0, 5 × 4_r, 4 × 5_r, and 4 × 4_r, respectively. Subsequently, the external module ATs for TCONVs (visible in Figure 5) sums the referred outputs to the homologous results furnished by the other CTCEs operating in parallel. In addition, both TAs and TBs can be used to perform different CONVs, as follows:     1. Twelve 1 × 1 CONVs, whose results are 1 × 1_ru, with u = 0, . . . ,11; 2. Nine 3 × 3 CONVs, with the furnished results being 3 × 3_rx, with x = 0, . . . ,8; 3.
One 7 × 7 CONV; in this case the results 5 × 5_r0 and the 5 × 5_r1 are added by the external module ATs for CONVs; 5.
Depending on which operation must be currently performed (e.g., CONVs or TCONVs) and based on the filter size k, the auxiliary multiplexing logic also depicted in Figure 9 coordinates the cooperation between TAs and TBs and guarantees that the different supported operations are performed correctly. The gray boxes represent the pipeline stages that, being deep as indicated by the reported numbers, time-align the performed computations.
It is worth noting that, in order to make the above-described CTCE able to support different up-sampling factors, just a few and simple modifications are required, either on the viable paths or on the compositions of the sub-circuits TAs and TBs.
In order to explain the rest of the elaboration, let us refer to Figure 6 and suppose that the first computational step, related to the first T M ifmaps, is just completed with the delivery of the first T N intermediate ofmaps as provided either by the module ATs for TCONVs or by the module ATs for CONVs. In the first step, such intermediate ofmaps are locally stored in the ofmaps buffer, waiting to be accumulated to the T N intermediate ofmaps that will be produced at the next step. The accumulation results are again locally stored in the buffer for the subsequent accumulations, and the operations go on in this way until the execution of the M T M -th step takes place, thus furnishing the final T N ofmaps. Before being transferred to an external data memory, the latter are rectified by the PReLU units implementing the linear rectification approach demonstrated in [24]. All of the operations described above are executed N T N times, i.e., until all the N final ofmaps are computed.

Experimental Results and Comparisons
As a case study, the real context of CNN-based SR imaging was referred to and the proposed approach was adopted to accelerate the popular FSRCNN model [10]. For this purpose, the hardware architecture described in the previous section was tailored to comply with the configurations summarized, layer by layer, in Table 1. Here, M and N refer to the number of ifmaps and ofmaps, k and S D are the kernel size and the up-sampling factor, and T M and T N are the number of ifmaps and ofmaps processed in parallel. It is worth noting that how many instances of the CTCU module are used, i.e., the value of R, is established at the design time to achieve a better trade-off between speed performances and area occupancy. For the referred case study, R = 12 was chosen, since it complies well with the requirements of the overall network model and allows reducing the inference time by more than 90% with respect to the case in which R = 1. Table 1 also reports the parameter P N , which indicates how many output values are computed in parallel for each of the T N furnished ofmaps. When the TCONV layer is executed, P N equals S D × S D , with S D being set to 2, 3, or 4, as established at the design time. The parameters M, N, k, and S D are elaborated by the FSM that: (1) The run-time configures the proposed hardware accelerator, thus ensuring that T M and T N change properly as required by each layer; (2) scans the various computational steps. The novel accelerator exploits fixed-point arithmetic with activations and filters quantized, respectively, to 16 and 10 bits. Such a choice, which arises from a preliminary analysis conducted to evaluate the impact of different quantization levels on the quality of reconstructed images, allows improving the area occupancy by 60% and 18% with respect to 32-and 16-bit fixed-point versions, respectively, with detrimental effects on the quality of reconstructed images. Three different versions of the novel accelerator, each performing the TCONV layer with a specific up-sampling factor, have been designed by using the very high-speed integrated circuits hardware description language (VHDL) at the register transfer-level abstraction. Experimental tests were performed using the Xilinx ZCU102 development board [25], experiencing a frame rate of 192.3 fps when 256 × 256 input images were processed. Implementation results, obtained utilizing the Xilinx XC7K410T and XCZU9EG FPGA devices and the 2019.2 Vivado Design Suite, were collected in Table 2, reporting that: - The amount of occupied look-up tables (LUTs), flip-flops (FFs), blocks of random access memory (BRAMs), and digital signal processing slices (DSPs); - The power consumption, estimated through the switching activity values file (SAIF) that, referring to several benchmark images, taking into account the real activities of all nodes within the analyzed circuit; - The speed performance, evaluated in terms of the maximum running frequency and the giga operations per second (GOPS), which is the ratio between the overall computational complexity of the referred model and the inference time; - The energy efficiency (GOPS/W), which is defined as the ratio between the GOPS and the power consumption.  [15]. 4 Calculated considering the 60 frames per second declared in [17]. Table 2 also summarizes the implementation characteristics of representative stateof-the-art FPGA-based designs that, being devoted to the acceleration of CNNs for the SR imaging, have been selected as the direct competitors, even though they refer to somewhat different models from the original FSRCNN presented in [10]. As an example, while the designs proposed here were characterized referring to the whole model reported in Table 1, thus performing four cascaded CONV layers with k = 3 (i.e., layers 3, 4, 5, and 6), the accelerators presented in [11,15,17] refer to simplified models and perform only one CONV layer with k = 3. As a further simplification, to relieve the computational load, the design described in [17] replaces the TCONV with an efficient sub-pixel CONV (ESPCN) layer that provides up-sampled ofmaps through a periodic shuffling [26]. Conversely, the reconfigurable design presented in [13] refers to the original FSRCNN model, but it performs CONVs with kernels sizes ranging from 1 × 1 to 4 × 4 and changes the TCONV kernel size from 9 × 9 to 8 × 8.
In order to point out the main differences between the network models accelerated by the compared designs, they are referenced in Table 2 as FSRCNN(x,y,z,w). There, x, y, z, and w are, respectively, the number of ofmaps outputted by the first CONV layer, the number of ofmaps furnished by the subsequent CONV layers, the last excepted, the number of cascaded CONV layers with kernel size k = 3, and the TCONV kernel size.
By examining the results summarized in Table 2, it can be observed that, although referring to the most complex CNN model, due to their particularly efficient flexible architecture, the proposed accelerators lead to lower power consumptions. The power savings achieved with respect to [11,17] come from the capability of the proposed designs of the run-time adapting to different CONV kernel sizes. Without such a capability, the implementations characterized in [11,15,17] must employ a different ad-hoc architecture for each layer, thus negatively affecting the power consumption and the resources requirements.
In comparison to [11], the proposed XCK410T-based implementations save more than 53.7% LUTs, 22.3% FFs, and 14.3% DSPs, and improve the energy efficiency by up to 25.5%, which is also the result of avoiding multiplications with sparse filters, as required by [11]. These advantages are obtained even though the CNN model referenced in the novel designs is more complex than [11], which instead benefits from the reduced model complexity in terms of GOPS. Table 2 clearly shows that the design demonstrated in [13] is particularly efficient in terms of occupied hardware resources. Nevertheless, when compared to [13], the novel accelerators implemented on the XCZU9EG chip consume~3% less power and achieve up to~16% higher GOPS, although they perform CONVs and TCONVs with greater kernel sizes and coefficients bit width.
The accelerator presented in [15] sacrifices a certain amount of hardware resources to implement a very deep pipeline, thus reaching the highest GOPS. However, such an advantage is obtained to the detriment of occupied LUTs, as a consequence of the Winograd algorithm implementation: In comparison to the proposed accelerator at a parity of implementation chip and S D , [15] performs~9.5% more GOPS, but the amount of occupied LUTs is~2.8 times higher.
Finally, from Table 2, it can be seen that, despite the simplifications introduced to reduce the computational complexity of the referred CNN model, at a parity of the upsampling factor S D = 2, the design proposed in [17] occupies~48.9% and~77% more LUTs and DSPs than the novel accelerator targeting the XCK410T chip. Furthermore, the design presented here exhibits considerably improved speed performances and power consumption, which lead to a~2.3 times higher energy efficiency.
For the sake of a fair analysis, the FSRCNN models referenced in Table were compared also in terms of the quality achieved at different up-scaling factors.
Software routines modeling the proposed accelerators were on-purpose written to process the popular Set-5, Set-14, and B100 datasets and to evaluate the peak signal-tonoise ratio (PSNR) and the structural similarity (SSIM) [27]. Table 3 clearly shows that the strategy adopted here to transform TCONVs into CONVs does not affect the quality of reconstructed images. Indeed, in most of the analyzed cases, slightly improved PSNR and SSIM were achieved with respect to [11,13,17]. Furthermore, the small quality loss experienced in a few cases is well compensated by the benefits offered by the proposed method over its competitors in terms of some implementation characteristics. It is worth noting that the counterpart [15] was not included in the comparisons because the quality metrics furnished in the original paper are related to quite different datasets.
Finally, Figure 10 shows a sample image from the Set-5 dataset that was up-sampled by using the proposed approach at S D = 2. As expected, the details were well reconstructed and, in this case, the achieved PSNR was 31.48 dB. method over its competitors in terms of some implementation characteristics. It is worth noting that the counterpart [15] was not included in the comparisons because the quality metrics furnished in the original paper are related to quite different datasets. Finally, Figure 10 shows a sample image from the Set-5 dataset that was up-sampled by using the proposed approach at SD = 2. As expected, the details were well reconstructed and, in this case, the achieved PSNR was 31.48 dB.

Conclusions
This paper presented an efficient hardware-oriented algorithm suitable to comply with the computational requirements of both the CONV and TCONV layers of many popular CNNs. The proposed approach was implemented by a flexible hardware architecture able for the run-time to adapt itself to different operating modes at various kernel and fmap sizes. In contrast to state-of-the-art counterparts, the novel strategy adopted here to transform TCONVs into CONVs does not require either pre-processing stages or offline kernels decompositions. Indeed, it exploits a simple reorganization of the sliding windows picked up from the incoming ifmaps. The capability of supporting different operating conditions and the simplicity of the remapping strategy led to reconfigurable hardware designs characterized by low power consumption, high-speed performance, and parsimonious utilization of logic resources.
In order to demonstrate the efficiency of the proposed approach, a fast super resolution CNN was referenced as a case study. Three versions of the novel reconfigurable hardware accelerator were implemented, each supporting a specific up-sampling factor. The characterization results obtained using the Xilinx XC7K410 FPGA device demonstrated that, although they refer to more complex CNN models, the proposed designs consume less power than their counterparts, occupying from 1.5 to 2.7 times less LUTs, and exhibiting an energy efficiency from 1.1 to 2.3 times higher. The tests performed on several datasets also demonstrated that the above advantages are achieved without compromising either the PSNR or the SSIM quality metrics.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results. Figure A1 details how the proposed approach processes the generic K C × K C sliding window with the pixel I i,j in the top-left position.