^{1}

^{*}

^{2}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

A high-speed dual super-systolic core for reconstructive signal processing (SP) operations consists of a double parallel systolic array (SA) machine in which each processing element of the array is also conceptualized as another SA in a bit-level fashion. In this study, we addressed the design of a high-speed dual super-systolic array (SSA) core for the enhancement/reconstruction of remote sensing (RS) imaging of radar/synthetic aperture radar (SAR) sensor systems. The selected reconstructive SP algorithms are efficiently transformed in their parallel representation and then, they are mapped into an efficient high performance embedded computing (HPEC) architecture in reconfigurable Xilinx field programmable gate array (FPGA) platforms. As an implementation test case, the proposed approach was aggregated in a HW/SW co-design scheme in order to solve the nonlinear ill-posed inverse problem of nonparametric estimation of the power spatial spectrum pattern (SSP) from a remotely sensed scene. We show how such dual SSA core, drastically reduces the computational load of complex RS regularization techniques achieving the required real-time operational mode.

Advances in digital signal processing have permeated many applications, providing unprecedented growth in capabilities. Complex sensors systems are computationally extremely expensive and the majority of them are not suitable for a real-time implementation [

Moreover, a mix of dedicated hardware solutions and programmable devices is found in applications for which no other approach can meet their real-time performance demands. Additionally, many hyperspectral imaging applications require a response in real-time in several areas for example, environmental modeling and assessment, target detection for military and homeland defense/security purposes, and risk prevention and response. Even though newer microprocessors can operate at several GHz in speed providing a maximum throughput in the gigaflops class, several contemporary applications such as space systems, airborne systems, missile seekers, tracking wildfires, detecting biological threats, and monitoring oil spills, to name a few, must rely on a combination of dedicated hardware and programmable embedded systems. As a result of the growth in the requisite parallel data processing, high-performance embedded computing (HPEC) architectures represent a real possible solution in order to achieve a high performance in real time implementations, especially in those applications that include complex RS reconstructive operations.

The main contribution of this paper is the design of an efficient high-speed dual super-systolic array (SSA) core for reconstructive signal processing (SP) operations of RS algorithms via the use of HPEC techniques. With the design of a bit-level dual SSA core architecture, the computational load of complex RS algorithms can be drastically reduced pursuing the required real-time operational mode for newer Geospatial applications.

The strategy for the implementation of a bit-level dual super-systolic architecture in a Xilinx Virtex field programmable gate array (FPGA) platform is aimed at enhancing the locality through the utilization of HPEC techniques to precisely represent loop programs and computing complex sequences of loop transformations (interchange, skewing, tiling,

Therefore, although there are some recently developed studies related to the implementation of RS applications [

Finally in Section 4.2 of this study, a test case study of how the dual SSA core drastically reduces the processing time of a high-resolution regularization technique for the enhancement/reconstruction of real world SAR images is presented. Such hardware implementation results illustrate the usefulness for the development of system-level optimization of high-resolution image enhancement tasks performed with real-world RS imagery. The authors believe that the proposed high-speed dual super-systolic core architecture is

The term remote sensing (RS) is used to describe the science of identifying, observing, and measuring an object without coming into direct contact with it. This process involves the detection and measurement of different wavelength radiations, reflected or emitted from distant objects or materials, by which they may be identified and categorized by class, type, substance, and spatial distribution. RS systems are thus made of sensors mounted on an aircraft or a spacecraft that gather information from the Earth’s surface. Synthetic Aperture Radar (SAR) is an array of active sensors, and it is widely used in remote sensing missions to achieve high-resolution Earth images. In recent years, several efforts have been directed towards the incorporation of high-performance computing (HPC) models to remote sensing missions. Moreover, advances in sensor technology are revolutionizing the way remotely sensed data are collected, managed, and analyzed. In particular, many current and future applications of remote sensing in earth science, space science, and soon in exploration science require real- or near-real-time processing capabilities. In

In this study, we also propose a design methodology for real time implementation of specialized arrays of processors in a high performance embedded computing (HPEC) architecture. This architecture is based on a dual super-systolic array core as coprocessors unit that is integrated in a MPSoC platform via a HW/SW co-design paradigm.

This approach represents a real possibility for high-speed reconstructive signal processing (SP) tasks for the enhancement/reconstruction of RS imagery. In addition, the authors believe that the FPGA/DSP-based systems in aggregation with novel bit-level super-systolic architectures are emerging as newer solutions which offer enormous computation potential in RS systems.

In this sub-section, we describe the HW/SW co-design methodology implemented in this study. The HW/SW co-design is a hybrid method aimed at increasing the flexibility of the implementation and improvement of the overall design process [

The HW/SW co-design methodology encompasses the following general stages:

Algorithmic implementation (reference simulation in MATLAB and C++ platforms);

Computational partitioning process;

Architecture design procedure using HPEC techniques.

From the analysis of the HW/SW co-design methodology, one can deduce that the RS algorithm is first adapted in a co-design scheme applying HPEC techniques, and then, the selected computationally complex reconstructive operations are efficiently implemented in bit-level high-throughput accelerator architectures.

In this sub-section, the procedure for the computational implementation of the RS-related regularization algorithms using MATLAB and C++ platforms is developed. With these algorithmic analyses, the effectiveness of the model employed in the HW/SW co-design is verified.

All the numerical test sequences are generated with the Fixed Point Toolbox [

Now, we briefly describe a family of previously developed nonparametric high-resolution RS imaging techniques [

Examples of different RS imaging methods are the following: the Constrained Least Squares (CLS) and the Weighted CLS, which are deterministic methods that incorporate partial error functions into the corresponding objective costs [

The above implementation schemes are optimized in order to solve RS imaging problems, stated as follows: the scene pixel-frame image _{(j)}_{u}_{(j)}_{diag} defines the vector composed of the principal diagonal of the embraced matrix, ⊙ defines the Shur-Hadamar (element by element) product and ^{(}^{p}^{)} represents the reconstructive/enhancement regularization technique, respectively.

To optimize the search of the SFO

Having established the optimal reconstructive RS estimators, let us now consider the way in which the processing of the data vector

First, the point spread matrix (PSMs) [

Second, the Shur-Hadamar operation and the parallel reconstructive/enhancement operations of ^{(p)}

In this subsection, it is presented how to perform an efficient HW/SW partitioning of the computational tasks. The aim of the partitioning problem is to find which computational tasks can be implemented in an efficient hardware architecture looking for the best trade-offs among the different solutions [

The proposed partitioning stage is clearly influenced by the target architecture onto which the HW and the SW will be mapped. We begin with the specifications of the system-level partitioning functions and detailing the selected design quality attributes for the HW/SW co-design aimed at the definition of the computational tasks that can be implemented in the dual super-systolic core form, namely: hardware area (

The system must always satisfy the constraints: 0 ≤ _{i}, i_{i}_{i}

Each block implementation {_{i}_{i}_{i}_{E}_{i}_{E}

Now, the HW/SW co-design system architecture is to be optimized via bounding the total expected system processing time _{i}_{i}_{C++}, represents the execution time required for implementing the corresponding RS-related regularization algorithms in the standard C++ computational environment.

Note that from the formal SW-level co-design point of view, such RS-regularized techniques,

Following the presented above partitioning paradigm, one can now decompose the fixed-point RS-regularized algorithms developed at the SW-design into the DSP/embedded processor and the specialized high-speed hardware accelerators _{i}, i

In the design, the SSAs require high data bandwidth of data exchange with the DSP/embedded processor. Another challenging task of the co-design is how to manage the large block of data avoiding unnecessary data transfers from/to the embedded processor to/from the proposed bit-level HW accelerator.

The main parameters to consider in the partitioning stage are the task execution speed and the area required by its HW-level implementation. Based on those parameter considerations, the HW/SW co-design is carried out, which consists in deciding which tasks should be executed in SW and which one should be implemented by HW. Additionally, a number of different loop optimization techniques (^{−5} referring to the MATLAB Fixed Point Toolbox [

The super-systolic array (SSA) is a generalization of the systolic array (SA). It is a specialized form of an architecture, where the cells (

The first stage of the SSA-based design flow of

In

Both SSA architectures perform the discrete-form representation of the desired spatial spectrum pattern (SSP) in a high-performance structure. The multiply-accumulate (MAC) operation implemented in each processing element (PE) is now depicted in

The internal structure of each PE presented in

The algorithm of the selected RS-reconstructive operations, ^{+} can be represented by nested loops or FOR-loops programs. First, let us define from the reconstructive RS estimator of _{ji}

Next, the localization method converts the algorithm into an algorithmic representation with local and regular dependencies [

Once the algorithm is transformed into their localized representation (

The tiling technique is a well known loop transformation used to automatically create sub-block algorithms [

Now, considering the locally recursive representation presented in

The second step of the tiling procedure consists in implement the loop permutation transformation based on the Polytope model [^{T} into the required new index-space _{P}^{T}=[^{T}. In this step, it is defined the Polytope model as a set of inequations such as _{P}_{P}^{T} represents the index-space after the

The source Polytope is described in a convex form by a set of half-spaces, where the intersection of all half-spaces corresponds to the Polytope and the target representation is presented as follows:

The new loop bounds are derived with the Fourier-Motzkin elimination algorithm. The resulting reconstructive RS algorithm after the loop permutation transformation is now presented, in which the proper substitutions are integrated:

The third step of the tiling procedure corresponds again to the strip-mining transformation procedure but in this case, the procedure is applied over the

From the analysis of the tiled parallel algorithm of

The space-time mapping procedure onto SAs is a technique that transforms an index-space representation into a space-time representation where each node of their iteration node is mapped to a certain PE and it is scheduled to a certain instance of time [_{m}^{N}^{N−1} maps the ^{N}^{N−1}), where _{m}_{m}

With these specifications, the transformation matrix becomes

The number of PEs required by each coarse grain SA of the dual SSA core architecture is

Once the coarse grain SAs of the selected RS reconstructive algorithm have been defined, we are ready to conceptualize and implement the bit-level fixed-sized dual SSA core. The internal structure of each fixed-sized SA contains identical PEs linearly-connected also in a systolic fashion. The same SA-based design flow, implemented in the previous sub-sections (

From the analysis of

The SA for performing the bit-level MAC operation of each PE of the dual SSA core employs the following specifications in the transformation defined by

This bit-level SA-based MAC architecture requires an array of

In this section, the results of the hardware-level implementation of the reconstructive complex RS functions with the employment of a high-speed fixed-sized dual SSA core accelerator are reported. The addressed architecture drastically reduce the computationally load of the enhancement/reconstruction real-world Geospatial images acquired with different fractional multisensory SAR systems. In order to demonstrate the best area-time trade-off of the digital implementation and the high accuracy of the proposed RS hardware accelerator, the authors have conducted some test-case scenarios of the real-world RS images characterized by the point spread function (PSF) of a Gaussian “bell” shape in both directions of the 2-D scene (in particular, of 16 pixel width at 0.5 from its maximum for the 1

The comparative performance analysis of the HW-level implementation of the dual SSA core architecture is presented in this sub-section. Such HW-level performance analysis is focused on define which is the best area-time tradeoff. The Xilinx XST tool of the Integrated Software Environment (ISE™) WebPACK™ was used for the synthesis of the proposed architecture. The clock frequency of 100 MHz is the selected timing constrain considered for the synthesis procedure. The following parameters were considered in the synthesis: the order of each fixed-sized SSA is

From the analysis of

Next, the scalability analysis in terms of HW resources is presented for the relevant dual SSA core architecture in

In such analysis, the number of precision bits and the number of processing elements (PEs) are modified in order to estimate the HW resources. The results reveal the area resource utilization of the dual SSA-based architecture in the FPGA-target platform.

Other alternative implementations for RS applications that implement specialized HW architectures (

In the next section, a concrete real-world Geospatial test application from multi-sensor array SAR systems scenario is presented. This test will evaluate the accuracy of the proposed dual SSA core that it is integrated in a MPSoC design via the HW/SW co-design. The reported results of such enhancement/reconstructive model will be also discussed further on in the next sub-section.

In this sub-section, we present an illustrative test case study related to the HW-implementation of the Weighted Constrain Least Square (WCLS) regularization technique for the enhancement/reconstruction of RS applications. This HW-implementation is based on the proposed here, dual SSA core in aggregation with a Microblaze embedded processor and the On Chip Peripheral Bus (OPB) for transferring the data to/from the embedded processor. In the HW design, we consider to use the precision of 32 bits fixed-point, 9-bit integer and 23-bits decimal for the implementation of all fixed-point operations in each SSA core. Once the HW/SW co-design methodology has been employed, we are ready to establish the verification statements to evaluate the accuracy of the MPSoC system.

In the HW implementation, a large scale (1_{r}_{a}

The quantitative measures of the image enhancement/reconstruction performance gains achieved with the particular employed WCLS technique, evaluated via the IOSNR metric, are reported in

Next, the qualitative results are presented in

From the analysis of the qualitative and quantitative implementation results reported in

Finally, in

In the first case in

The implementation of this high-speed architecture helps to drastically reduce the overall processing time. Particularly, the proposed implementation of the WCLS algorithm with the proposed HW-specialized architecture takes only 0.81 s for the large-scale RS image reconstruction in contrast to 12.6 s required with the C++ reference implementation. Thus, the processing time of the proposed dual SSA core-oriented architecture is approximately 16 times less than the corresponding processing time achievable with the conventional C++ PC-based implementation.

In this regard, the emergence of specialized hardware devices such as the dual SSA core in FPGAs represents a new paradigm to develop real-time systems for remote sensing data processing. The increasing computational demands of remote sensing applications can now be benefit from these compact hardware components, taking advantage of the small size and relatively low cost of these units as compared to clusters or networks of computers. These aspects are of great importance in other areas like hyperspectral imaging for Earth observation and remote sensing missions that require an extremely large number of spectral bands and high spatial resolution.

In this study, a high-speed dual SSAs core which accelerates complex regularization operations for the real-time enhancement/reconstruction of large-scale RS imaging of radar/SAR sensor systems is presented. Also, the design methodology for real time implementation of such specialized arrays of processors using high performance embedded computing (HPEC) architecture was developed.

The dual SSA core was evaluated as follows: the architecture was aggregated with a Microblaze embedded processor in MPSoC structure via the HW/SW co-design paradigm. The WCLS regularization method was algorithmically adapted (using parallel computing techniques) and implemented in a real time computational mode (the ‘real-time’ being understood in a context of conventional RS users). The performance analysis and the qualitative and quantitative results reveal that the dual SSA core as accelerator units drastically increase the throughput and the processing time of large-scale real-time image processing requirements while performing the reconstruction of real-world hyperspectral RS imagery. In addition, the authors believe that the FPGA/DSP-based systems in aggregation with novel bit-level super-systolic architectures offer enormous computation potential in RS systems for newer Geospatial applications.

The study was supported by Consejo Nacional de Ciencia y Tecnología (México) under the grant CB-2010-01-158136.

MPSoC platform of RS algorithms via the HW/SW co-design paradigm.

MPSoC platform of RS algorithms via the HW/SW co-design paradigm.

Dual SSA core of the RS-related estimator.

MAC operation of each PE.

Bit-level SSA of the MAC structure.

HW-resource scalability analysis: (

Implementation results for the first observation scenario: (SNR μ = 10 dB): (

Implementation results for the second observation scenario: (SNR μ = 10 dB): (

Synthesis results of the proposed dual SSA core. SSA order:

| ||
---|---|---|

LUTs | 12,416 | 12,416 |

Slices | 10,624 | 20,480 |

Flip-Flops | 20,480 | 12,288 |

Output bit-width | 32 | 32 |

Max. Clock freq. (MHz) | 910.47 | 920.93 |

IOSNR of the WCLS algorithm evaluated for different SNRs.

_{a} |
_{a} | |||
---|---|---|---|---|

| ||||

^{(CLS)} [dB] |
^{(WCLS)} [dB] |
^{(CLS)} [dB] |
^{(WCLS)} [dB] | |

| ||||

5 | 2.12 | 3.26 | 2.67 | 3.92 |

10 | 3.43 | 4.45 | 4.59 | 5.83 |

15 | 4.17 | 5.23 | 5.51 | 7.64 |

20 | 5.36 | 6.82 | 6.47 | 9.87 |

25 | 6.94 | 8.27 | 8.32 | 11.16 |

Processing times required for implementing the WCLS algorithm.

PC-Oriented Implementation of the WCLS | 12.6 |

Implemented with the proposed dual SSA core architecture | 0.81 |