<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Sensors</journal-id>
<journal-title>Sensors</journal-title>
<issn pub-type="epub">1424-8220</issn>
<publisher>
<publisher-name>Molecular Diversity Preservation International (MDPI)</publisher-name></publisher></journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3390/s120302539</article-id>
<article-id pub-id-type="publisher-id">sensors-12-02539</article-id>
<article-categories>
<subj-group>
<subject>Article</subject></subj-group></article-categories>
<title-group>
<article-title>Dual Super-Systolic Core for Real-Time Reconstructive Algorithms of High-Resolution Radar/SAR Imaging Systems</article-title></title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Atoche</surname><given-names>Alejandro Castillo</given-names></name><xref ref-type="aff" rid="af1-sensors-12-02539"><sup>1</sup></xref><xref ref-type="corresp" rid="c1-sensors-12-02539"><sup>*</sup></xref></contrib>
<contrib contrib-type="author">
<name><surname>Castillo</surname><given-names>Javier Vázquez</given-names></name><xref ref-type="aff" rid="af2-sensors-12-02539"><sup>2</sup></xref></contrib></contrib-group>
<aff id="af1-sensors-12-02539">
<label>1</label> Department of Mechatronics, Autonomous University of Yucatan, Av. Industrias No Contaminantes s/n, Cordemex, 97203, Merida, Yuc., Mexico</aff>
<aff id="af2-sensors-12-02539">
<label>2</label> Science and Engineering Division, University of Quintana Roo, Boulevard Bahia s/n, Chetumal, QRoo 77010, Mexico; E-Mail: <email>jvazquez@uqroo.mx</email></aff>
<author-notes>
<corresp id="c1-sensors-12-02539">
<label>*</label>Author to whom correspondence should be addressed; E-Mail: <email>acastill@uady.mx</email>; Tel.: +52-999-930-0550; Fax: +52-999-930-0559.</corresp></author-notes>
<pub-date pub-type="collection">
<month>3</month>
<year>2012</year></pub-date>
<pub-date pub-type="epub">
<day>24</day>
<month>2</month>
<year>2012</year></pub-date>
<volume>12</volume>
<issue>3</issue>
<fpage>2539</fpage>
<lpage>2560</lpage>
<history>
<date date-type="received">
<day>3</day>
<month>12</month>
<year>2011</year></date>
<date date-type="rev-recd">
<day>26</day>
<month>1</month>
<year>2012</year></date>
<date date-type="accepted">
<day>21</day>
<month>2</month>
<year>2012</year></date></history>
<permissions>
<copyright-statement>© 2012 by the authors; licensee MDPI, Basel, Switzerland</copyright-statement>
<copyright-year>2012</copyright-year>
<license>
<p>This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).</p></license></permissions>
<abstract>
<p>A high-speed dual super-systolic core for reconstructive signal processing (SP) operations consists of a double parallel systolic array (SA) machine in which each processing element of the array is also conceptualized as another SA in a bit-level fashion. In this study, we addressed the design of a high-speed dual super-systolic array (SSA) core for the enhancement/reconstruction of remote sensing (RS) imaging of radar/synthetic aperture radar (SAR) sensor systems. The selected reconstructive SP algorithms are efficiently transformed in their parallel representation and then, they are mapped into an efficient high performance embedded computing (HPEC) architecture in reconfigurable Xilinx field programmable gate array (FPGA) platforms. As an implementation test case, the proposed approach was aggregated in a HW/SW co-design scheme in order to solve the nonlinear ill-posed inverse problem of nonparametric estimation of the power spatial spectrum pattern (SSP) from a remotely sensed scene. We show how such dual SSA core, drastically reduces the computational load of complex RS regularization techniques achieving the required real-time operational mode.</p></abstract>
<kwd-group>
<kwd>super-systolic</kwd>
<kwd>parallel computing</kwd>
<kwd>remote sensing</kwd>
<kwd>FPGA</kwd></kwd-group></article-meta></front>
<body>
<sec sec-type="intro">
<label>1.</label>
<title>Introduction</title>
<p>Advances in digital signal processing have permeated many applications, providing unprecedented growth in capabilities. Complex sensors systems are computationally extremely expensive and the majority of them are not suitable for a real-time implementation [<xref ref-type="bibr" rid="b1-sensors-12-02539">1</xref>–<xref ref-type="bibr" rid="b20-sensors-12-02539">20</xref>]. Through the advent of programmable computing, many of these processing remote sensing (RS) algorithms have been implemented in more general-purpose computing while still preserving compute-intensive functions in dedicated hardware.</p>
<p>Moreover, a mix of dedicated hardware solutions and programmable devices is found in applications for which no other approach can meet their real-time performance demands. Additionally, many hyperspectral imaging applications require a response in real-time in several areas for example, environmental modeling and assessment, target detection for military and homeland defense/security purposes, and risk prevention and response. Even though newer microprocessors can operate at several GHz in speed providing a maximum throughput in the gigaflops class, several contemporary applications such as space systems, airborne systems, missile seekers, tracking wildfires, detecting biological threats, and monitoring oil spills, to name a few, must rely on a combination of dedicated hardware and programmable embedded systems. As a result of the growth in the requisite parallel data processing, high-performance embedded computing (HPEC) architectures represent a real possible solution in order to achieve a high performance in real time implementations, especially in those applications that include complex RS reconstructive operations.</p>
<p>The main contribution of this paper is the design of an efficient high-speed dual super-systolic array (SSA) core for reconstructive signal processing (SP) operations of RS algorithms via the use of HPEC techniques. With the design of a bit-level dual SSA core architecture, the computational load of complex RS algorithms can be drastically reduced pursuing the required real-time operational mode for newer Geospatial applications.</p>
<p>The strategy for the implementation of a bit-level dual super-systolic architecture in a Xilinx Virtex field programmable gate array (FPGA) platform is aimed at enhancing the locality through the utilization of HPEC techniques to precisely represent loop programs and computing complex sequences of loop transformations (interchange, skewing, tiling, <italic>etc</italic>.) while preserving the original program semantics and also, by mapping the transformed algorithmic representation in SSAs. Likewise, this transformation performs the hardware projections in a bit-level parallel fashion. It is important to remark that with the combination of different hardware scheduler and allocation functions other SSA’s architectures can be implemented.</p>
<p>Therefore, although there are some recently developed studies related to the implementation of RS applications [<xref ref-type="bibr" rid="b6-sensors-12-02539">6</xref>–<xref ref-type="bibr" rid="b20-sensors-12-02539">20</xref>], there still remain some unresolved implementation issues related to the efficient hardware level implementation of multi-processor system-on-chip (MPSoC) architectures. Typical sequential implementations of these RS algorithms are traditionally implemented in synthetic aperture radar (SAR) simulations scenarios with model uncertainties previously developed by [<xref ref-type="bibr" rid="b6-sensors-12-02539">6</xref>,<xref ref-type="bibr" rid="b7-sensors-12-02539">7</xref>]. Moreover, the use of novel parallel computing techniques applied in the design of the proposed SSAs will allow the maximum possible of parallelism and the best performance implementation than software simulations and traditional HW-level architectures provided in other studies [<xref ref-type="bibr" rid="b8-sensors-12-02539">8</xref>–<xref ref-type="bibr" rid="b11-sensors-12-02539">11</xref>,<xref ref-type="bibr" rid="b16-sensors-12-02539">16</xref>–<xref ref-type="bibr" rid="b19-sensors-12-02539">19</xref>,<xref ref-type="bibr" rid="b21-sensors-12-02539">21</xref>]. For example, a HW/SW co-design method was developed in [<xref ref-type="bibr" rid="b17-sensors-12-02539">17</xref>], in order to achieve the near real time implementation of the convex regularization-based procedures for reconstructive signal processing operations. However, in that previous study, no SSA architectures were considered. In [<xref ref-type="bibr" rid="b16-sensors-12-02539">16</xref>], an approach with the incorporation of SA-based implementation schemes is proposed, but the architecture is reduced only to the matched space filtering based on the triple matrix multiplication and the 1-D convolution. In [<xref ref-type="bibr" rid="b21-sensors-12-02539">21</xref>], a bit-level high-speed VLSI architecture of a matrix-vector using multiple arrays of processors in aggregation with the HW/SW co-design is presented. In the study, only the main concepts of a SSA architecture are described and a simple bit-level matrix-vector structure is presented without detail.</p>
<p>Finally in Section 4.2 of this study, a test case study of how the dual SSA core drastically reduces the processing time of a high-resolution regularization technique for the enhancement/reconstruction of real world SAR images is presented. Such hardware implementation results illustrate the usefulness for the development of system-level optimization of high-resolution image enhancement tasks performed with real-world RS imagery. The authors believe that the proposed high-speed dual super-systolic core architecture is <italic>unique</italic> and <italic>differs completely</italic> from the approaches of the recently developed studies discussed above. Additionally, with the relevant bit-level dual-core architecture based on SSAs, a new paradigm related to the design of a specialized HPEC hardware module is introduced.</p></sec>
<sec sec-type="methods">
<label>2.</label>
<title>Design Flow</title>
<p>The term remote sensing (RS) is used to describe the science of identifying, observing, and measuring an object without coming into direct contact with it. This process involves the detection and measurement of different wavelength radiations, reflected or emitted from distant objects or materials, by which they may be identified and categorized by class, type, substance, and spatial distribution. RS systems are thus made of sensors mounted on an aircraft or a spacecraft that gather information from the Earth’s surface. Synthetic Aperture Radar (SAR) is an array of active sensors, and it is widely used in remote sensing missions to achieve high-resolution Earth images. In recent years, several efforts have been directed towards the incorporation of high-performance computing (HPC) models to remote sensing missions. Moreover, advances in sensor technology are revolutionizing the way remotely sensed data are collected, managed, and analyzed. In particular, many current and future applications of remote sensing in earth science, space science, and soon in exploration science require real- or near-real-time processing capabilities. In <xref ref-type="fig" rid="f1-sensors-12-02539">Figure 1</xref>, a multi-sensor image acquisition and reconstructive processing system based on a MPSoC platform for the enhancement/reconstruction of RS algorithms via the HW/SW co-design paradigm is illustrated.</p>
<p>In this study, we also propose a design methodology for real time implementation of specialized arrays of processors in a high performance embedded computing (HPEC) architecture. This architecture is based on a dual super-systolic array core as coprocessors unit that is integrated in a MPSoC platform via a HW/SW co-design paradigm.</p>
<p>This approach represents a real possibility for high-speed reconstructive signal processing (SP) tasks for the enhancement/reconstruction of RS imagery. In addition, the authors believe that the FPGA/DSP-based systems in aggregation with novel bit-level super-systolic architectures are emerging as newer solutions which offer enormous computation potential in RS systems.</p>
<sec sec-type="methods">
<label>2.1.</label>
<title>HW/SW Co-Design Methodology</title>
<p>In this sub-section, we describe the HW/SW co-design methodology implemented in this study. The HW/SW co-design is a hybrid method aimed at increasing the flexibility of the implementation and improvement of the overall design process [<xref ref-type="bibr" rid="b16-sensors-12-02539">16</xref>–<xref ref-type="bibr" rid="b20-sensors-12-02539">20</xref>]. When a co-processor-based solution is employed in the HW/SW co-design architecture, the computational time can be drastically reduced. Two opposite alternatives can be considered when exploring the HW/SW co-design of a complex SP system. One of them is the use of standard components whose functionality can be defined by means of programming. The other one is the implementation of this functionality via a microelectronic circuit specifically tailored for that application. It is well known that the first alternative (the software alternative) provides solutions that present a great flexibility in spite of high area requirements and long execution times, while the second one (the hardware alternative) optimizes the size aspects and the operation speed but limits the flexibility of the solution. Halfway between both, hardware/software co-design techniques try to obtain an appropriate trade-off between the advantages and drawbacks of these two approaches.</p>
<p>The HW/SW co-design methodology encompasses the following general stages:
<list list-type="roman-lower">
<list-item>
<p>Algorithmic implementation (reference simulation in MATLAB and C++ platforms);</p></list-item>
<list-item>
<p>Computational partitioning process;</p></list-item>
<list-item>
<p>Architecture design procedure using HPEC techniques.</p></list-item></list></p>
<p>From the analysis of the HW/SW co-design methodology, one can deduce that the RS algorithm is first adapted in a co-design scheme applying HPEC techniques, and then, the selected computationally complex reconstructive operations are efficiently implemented in bit-level high-throughput accelerator architectures.</p>
<sec sec-type="methods">
<label>2.1.1.</label>
<title>Algorithmic Implementation Analysis</title>
<p>In this sub-section, the procedure for the computational implementation of the RS-related regularization algorithms using MATLAB and C++ platforms is developed. With these algorithmic analyses, the effectiveness of the model employed in the HW/SW co-design is verified.</p>
<p>All the numerical test sequences are generated with the Fixed Point Toolbox [<xref ref-type="bibr" rid="b22-sensors-12-02539">22</xref>] of MATLAB in order to verify computationally the proposed HW/SW co-design methodology (<italic>i.e.</italic>, test sequences for performing the SW simulation and for the HW verifications). Also, the Minimum Square Error (MSE) test is implemented to verify the correct fixed-point implementation (<italic>i.e.</italic>, for signed numbers in two’s complement format). In the case of C++ platform, this analysis is performing in order to evaluate the real-time performance analysis. The results of such SW simulation and HW performance analysis will be presented and discussed further on in Sections 4.1 and 4.2.</p>
<p>Now, we briefly describe a family of previously developed nonparametric high-resolution RS imaging techniques [<xref ref-type="bibr" rid="b12-sensors-12-02539">12</xref>,<xref ref-type="bibr" rid="b13-sensors-12-02539">13</xref>,<xref ref-type="bibr" rid="b15-sensors-12-02539">15</xref>–<xref ref-type="bibr" rid="b20-sensors-12-02539">20</xref>], via the generalization of their regularization optimization formalism. Such techniques incorporate different regularization and computation paradigms that enable one to modify some controllable algorithmic-level “degrees of freedom” as well as design a variety of efficient aggregated/fused data/image processing methods.</p>
<p>Examples of different RS imaging methods are the following: the Constrained Least Squares (CLS) and the Weighted CLS, which are deterministic methods that incorporate partial error functions into the corresponding objective costs [<xref ref-type="bibr" rid="b20-sensors-12-02539">20</xref>]. In [<xref ref-type="bibr" rid="b12-sensors-12-02539">12</xref>,<xref ref-type="bibr" rid="b13-sensors-12-02539">13</xref>], the unified descriptive experiment design regularization (DEDR) paradigm incorporates into the unified optimization problem, other robust and more sophisticated statistical methods, among them are: the rough conventional matched spatial filtering (MSF) approach [<xref ref-type="bibr" rid="b3-sensors-12-02539">3</xref>]; the descriptive maximum entropy (ME) technique [<xref ref-type="bibr" rid="b20-sensors-12-02539">20</xref>]; the robust spatial filtering (RSF) method [<xref ref-type="bibr" rid="b12-sensors-12-02539">12</xref>], the robust adaptive spatial filtering (RASF) technique [<xref ref-type="bibr" rid="b13-sensors-12-02539">13</xref>], the fused Bayesian-DEDR regularization (FBR) method [<xref ref-type="bibr" rid="b20-sensors-12-02539">20</xref>]; <italic>etc</italic>. All such DEDR optimization procedures have been detailed in previous studies [<xref ref-type="bibr" rid="b12-sensors-12-02539">12</xref>,<xref ref-type="bibr" rid="b13-sensors-12-02539">13</xref>,<xref ref-type="bibr" rid="b16-sensors-12-02539">16</xref>–<xref ref-type="bibr" rid="b20-sensors-12-02539">20</xref>]. It is important to remark that due to the non-linearity of the objective functions, the solution of the parametrically controlled fusion-optimization problem will require extremely complex no-parametric algorithms [<xref ref-type="bibr" rid="b20-sensors-12-02539">20</xref>] and result in a technically intractable computational schemes if solve these problems employing the standard simulation software and hardware platforms based on DSPs and networks of CPUs.</p>
<p>The above implementation schemes are optimized in order to solve RS imaging problems, stated as follows: the scene pixel-frame image <bold>B̂</bold> is estimated via lexicographical reordering <bold>B̂</bold> = <italic>L</italic>{<bold>b̂</bold>} of the spatial spectrum pattern (SSP) vector <bold>b̂</bold> reconstructed from whatever available measurements of independent realizations {<bold>u</bold><italic><sub>(j)</sub></italic>; <italic>j</italic> = 1, …, <italic>J</italic>} of the recorded data vector. Thus, one can seek to find, <bold>b̂</bold>, as a discrete-form representation of the desired SSP, given the data correlation matrix <bold>R<sub>u</sub></bold> = <bold>Y</bold> pre-estimated empirically via averaging <italic>J</italic> ≥ 1 recorded data vector snapshots {<bold>u</bold><italic><sub>(j)</sub></italic>} [<xref ref-type="bibr" rid="b12-sensors-12-02539">12</xref>]; and by determining the solution operator that we also refer to as the signal formation operator (SFO) <bold>F</bold> such that:
<disp-formula id="FD1">
<label>(1)</label>
<mml:math display="block">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:msup>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mo mathvariant="bold">^</mml:mo></mml:mover></mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="bold">R</mml:mi>
<mml:mo mathvariant="bold">^</mml:mo></mml:mover></mml:mrow>
<mml:mi mathvariant="normal">e</mml:mi></mml:msub>
<mml:mo stretchy="false">}</mml:mo></mml:mrow>
<mml:mrow>
<mml:mtext>diag</mml:mtext></mml:mrow></mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">F</mml:mi></mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">YF</mml:mi></mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo></mml:mrow></mml:msup>
<mml:mo stretchy="false">}</mml:mo></mml:mrow>
<mml:mrow>
<mml:mtext>diag</mml:mtext></mml:mrow></mml:msub>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">F</mml:mi></mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">uu</mml:mi></mml:mrow>
<mml:mo>+</mml:mo></mml:msup>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">F</mml:mi></mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo></mml:mrow></mml:msup>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mrow>
<mml:mtext>diag</mml:mtext></mml:mrow></mml:msub></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">F</mml:mi></mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup>
<mml:mi mathvariant="bold">u</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">⊙</mml:mo>
<mml:mo> </mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">F</mml:mi></mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:msup>
<mml:mi mathvariant="bold">u</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mo>+</mml:mo></mml:msup>
<mml:mo>;</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>...</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>P</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>where {·}<sub>diag</sub> defines the vector composed of the principal diagonal of the embraced matrix, ⊙ defines the Shur-Hadamar (element by element) product and <bold>F</bold><sup>(</sup><italic><sup>p</sup></italic><sup>)</sup> represents the reconstructive/enhancement regularization technique, respectively.</p>
<p>To optimize the search of the SFO <bold>F</bold>, the following <italic>DEDR</italic> strategy [<xref ref-type="bibr" rid="b12-sensors-12-02539">12</xref>] is formulated:
<disp-formula id="FD2">
<label>(2)</label>
<mml:math display="block">
<mml:mrow>
<mml:mi mathvariant="bold">F</mml:mi>
<mml:mo>→</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mtext>min</mml:mtext></mml:mrow>
<mml:mi mathvariant="bold">F</mml:mi></mml:munder>
<mml:mo> </mml:mo>
<mml:mo stretchy="false">{</mml:mo>
<mml:mi>ℜ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="bold">F</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">}</mml:mo></mml:mrow></mml:math></disp-formula>where:
<disp-formula id="FD3">
<label>(3)</label>
<mml:math display="block">
<mml:mrow>
<mml:mi>ℜ</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="bold">F</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:mtext>trace</mml:mtext>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="bold">FS</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="bold">I</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi mathvariant="bold">A</mml:mi>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="bold">FS</mml:mi>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="bold">I</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow>
<mml:mo>+</mml:mo></mml:msup></mml:mrow>
<mml:mo>}</mml:mo></mml:mrow>
<mml:mo>+</mml:mo>
<mml:mi>α</mml:mi>
<mml:mo> </mml:mo>
<mml:mtext>trace</mml:mtext>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">FR</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi></mml:msub>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">F</mml:mi></mml:mrow>
<mml:mo>+</mml:mo></mml:msup></mml:mrow>
<mml:mo>}</mml:mo></mml:mrow></mml:mrow></mml:math></disp-formula>implies the minimization of the weighted sum of the systematic and fluctuation errors in the desired estimate <bold>b̂</bold>, where the selection (adjustment) of the regularization parameter <italic>α</italic> and the weight matrix <bold>A</bold> provide the additional experiment design degrees of freedom incorporating any descriptive properties of a solution if those are known a priori [<xref ref-type="bibr" rid="b12-sensors-12-02539">12</xref>,<xref ref-type="bibr" rid="b20-sensors-12-02539">20</xref>]. For more detailed information related in how to optimize the SFO <bold>F</bold>, see references [<xref ref-type="bibr" rid="b12-sensors-12-02539">12</xref>,<xref ref-type="bibr" rid="b13-sensors-12-02539">13</xref>].</p>
<p>Having established the optimal reconstructive RS estimators, let us now consider the way in which the processing of the data vector <bold>u</bold>, which results in the optimum estimate <bold>b̂</bold>, can be performed computationally. For this purpose, we refer to the estimator (1) as a multi-stage computational procedure. Also, from the algorithmic analysis, we outline the following important remarks regarding a possible hardware level architecture accelerator for complex reconstructive computational tasks required for implementing different RS imaging methods.
<list list-type="roman-lower">
<list-item>
<p>First, the point spread matrix (PSMs) [<xref ref-type="bibr" rid="b12-sensors-12-02539">12</xref>] operations of the SFO can be calculated in parallel over the azimuth and range axes can be calculated concurrently.</p></list-item>
<list-item>
<p>Second, the Shur-Hadamar operation and the parallel reconstructive/enhancement operations of <bold>F</bold><sup>(<italic>p</italic>)</sup> <bold>u</bold> are able to be designed in a dual-core architecture. Notice that in <xref ref-type="disp-formula" rid="FD1">Equation (1)</xref>, the complex signal processing operations were algorithmically adapted for their efficient implementation.</p></list-item></list></p></sec>
<sec>
<label>2.1.2.</label>
<title>Computational Partitioning Process</title>
<p>In this subsection, it is presented how to perform an efficient HW/SW partitioning of the computational tasks. The aim of the partitioning problem is to find which computational tasks can be implemented in an efficient hardware architecture looking for the best trade-offs among the different solutions [<xref ref-type="bibr" rid="b23-sensors-12-02539">23</xref>,<xref ref-type="bibr" rid="b24-sensors-12-02539">24</xref>]. The solution of the problem requires, first, the definition of a partitioning model that meets all the specification requirements (<italic>i.e</italic>., functionality, goals and constraints).</p>
<p>The proposed partitioning stage is clearly influenced by the target architecture onto which the HW and the SW will be mapped. We begin with the specifications of the system-level partitioning functions and detailing the selected design quality attributes for the HW/SW co-design aimed at the definition of the computational tasks that can be implemented in the dual super-systolic core form, namely: hardware area (<italic>ha</italic>), hardware execution time (<italic>ht</italic>), software execution time (<italic>St</italic>), and the selected system resolution (<italic>n</italic>); where <italic>maxha</italic>, <italic>maxht</italic> and <italic>maxSt</italic> represent the upper bounds of these constraints. In particular, for implementing the fixed-point RS estimator operations of <xref ref-type="disp-formula" rid="FD1">Equation (1)</xref>, the partitioning process must satisfy the following performance requirements [<xref ref-type="bibr" rid="b25-sensors-12-02539">25</xref>].</p>
<list list-type="roman-lower">
<list-item>
<p>The system must always satisfy the constraints: 0 ≤ <italic>ha &lt; maxha</italic>, 0 ≤ <italic>ht &lt; maxht</italic>, for each <italic>i</italic>th hardware accelerator <italic>Ac<sub>i</sub>, i</italic> = 1,…,<italic>l</italic>; and 0 ≤ <italic>St &lt; maxSt</italic>, for the DSP/embedded processor <italic>E</italic>. These parallel hardware accelerators {<italic>Ac<sub>i</sub></italic>} and the DSP/embedded processor compose the target architecture <italic>Target</italic> = {<italic>E</italic>, <italic>Ac<sub>i</sub></italic>, <italic>n</italic> }, for the pre-selected FPGA with the corresponding predetermined architecture constraints <bold><italic>C</italic></bold>: {0 ≤ <italic>ha &lt; maxha</italic>; 0 ≤ <italic>ht &lt; maxht</italic>; 0 ≤ <italic>St &lt; maxSt</italic>}.</p></list-item>
<list-item>
<p>Each block implementation {<italic>Bl</italic> (<italic>Ac<sub>i</sub></italic>), <italic>Bl</italic> (<italic>E</italic>)} must satisfy the predefined execution time performance requirements: <italic>τ</italic>{<italic>Bl</italic>(<italic>Ac<sub>i</sub></italic>|<italic>C<sub>i</sub></italic>); <italic>i</italic> = 1,…,<italic>l</italic>} and <italic>τ</italic>{<italic>Bl</italic>(<italic>E</italic>|<bold>C</bold><italic><sub>E</sub></italic>)} conditioned by the specified above architecture constraints {<bold>C</bold><italic><sub>i</sub></italic>: {0 ≤ <italic>hti &lt; maxhti</italic>; 0 ≤ <italic>hai &lt; maxhai</italic>}; <italic>∀ i</italic> = 1,…, <italic>l</italic>}, and <bold>C</bold><italic><sub>E</sub></italic>: 0 ≤ <italic>St &lt; maxSt</italic>, correspondingly.</p></list-item></list>
<p>Now, the HW/SW co-design system architecture is to be optimized via bounding the total expected system processing time <italic>τ</italic> = <italic>τ</italic>{<italic>Bl</italic>(<italic>Ac<sub>i</sub></italic>|<bold>C</bold><italic><sub>i</sub></italic>)}evaluated via:
<disp-formula id="FD4">
<label>(4)</label>
<mml:math display="block">
<mml:mrow>
<mml:mi>τ</mml:mi>
<mml:mo stretchy="false">{</mml:mo>
<mml:mi mathvariant="italic">Bl</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">Ac</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">C</mml:mi></mml:mrow>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">}</mml:mo>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mtext>max</mml:mtext></mml:mrow>
<mml:mi>i</mml:mi></mml:munder>
<mml:mo stretchy="false">{</mml:mo>
<mml:mi>τ</mml:mi>
<mml:mo stretchy="false">{</mml:mo>
<mml:mi mathvariant="italic">Bl</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">Ac</mml:mi></mml:mrow>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">C</mml:mi></mml:mrow>
<mml:mi mathvariant="normal">i</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">}</mml:mo>
<mml:mo stretchy="false">}</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>τ</mml:mi>
<mml:mo stretchy="false">{</mml:mo>
<mml:mi mathvariant="italic">Bl</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>E</mml:mi>
<mml:mo stretchy="false">|</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">C</mml:mi></mml:mrow>
<mml:mi>E</mml:mi></mml:msub>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">}</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>&lt;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="normal">T</mml:mi></mml:mrow>
<mml:mrow>
<mml:mi mathvariant="normal">C</mml:mi>
<mml:mo>+</mml:mo>
<mml:mo>+</mml:mo></mml:mrow></mml:msub></mml:mrow></mml:math></disp-formula>where T<sub>C++</sub>, represents the execution time required for implementing the corresponding RS-related regularization algorithms in the standard C++ computational environment.</p>
<p>Note that from the formal SW-level co-design point of view, such RS-regularized techniques, <xref ref-type="disp-formula" rid="FD1">Equations (1</xref>–<xref ref-type="disp-formula" rid="FD3">3)</xref> can be considered as a properly ordered sequence of the reconstructive signal processing operations that one next can perform in an efficient computational fashion using the proposed above HW/SW co-design paradigm.</p></sec>
<sec sec-type="methods">
<label>2.1.3.</label>
<title>Architecture Design Procedure Using HPEC Techniques</title>
<p>Following the presented above partitioning paradigm, one can now decompose the fixed-point RS-regularized algorithms developed at the SW-design into the DSP/embedded processor and the specialized high-speed hardware accelerators <italic>Ac<sub>i</sub>, i</italic> = 1, …, <italic>l</italic>. In this study, the proposed bit-level dual super-systolic core is aggregated with a DSP/embedded processor via the proposed above HW/SW co-design, as illustrated in <xref ref-type="fig" rid="f1-sensors-12-02539">Figure 1</xref>.</p>
<p>In the design, the SSAs require high data bandwidth of data exchange with the DSP/embedded processor. Another challenging task of the co-design is how to manage the large block of data avoiding unnecessary data transfers from/to the embedded processor to/from the proposed bit-level HW accelerator.</p>
<p>The main parameters to consider in the partitioning stage are the task execution speed and the area required by its HW-level implementation. Based on those parameter considerations, the HW/SW co-design is carried out, which consists in deciding which tasks should be executed in SW and which one should be implemented by HW. Additionally, a number of different loop optimization techniques (<italic>i.e</italic>., loop optimization, loop unrolling, tiling, loop interchange, <italic>etc</italic>.) used in HPEC are implemented in order to exploit the maximum possible parallelism in the design (see [<xref ref-type="bibr" rid="b2-sensors-12-02539">2</xref>], for more details). Also, the fixed-point software analysis stage (<italic>i.e</italic>., for this study is employed the selection of 9 bits integer and 23 fractional bits with rounding to nearest format for all the fixed-point operations) and the C/C++ reference implementation is realized. Such precision guarantees numerical computational errors less than 10<sup>−5</sup> referring to the MATLAB Fixed Point Toolbox [<xref ref-type="bibr" rid="b22-sensors-12-02539">22</xref>]. Remark that the RS acquired images are stored and loaded from a compact flash device, and the resulting enhanced images are also stored to the same memory device. Finally, the architecture in form of a dual SSA core may be implemented on Field Programmable Gate Arrays (FPGAs) or coarse-grained [<xref ref-type="bibr" rid="b26-sensors-12-02539">26</xref>] programmable array architectures.</p></sec></sec></sec>
<sec>
<label>3.</label>
<title>Dual Super-Systolic Array Core</title>
<p>The super-systolic array (SSA) is a generalization of the systolic array (SA). It is a specialized form of an architecture, where the cells (<italic>i.e</italic>., processors), compute the data and store it independently of each other. SSAs consist of a network of cells (<italic>i.e</italic>., processing elements (PE)) in which each cell is conceptualized as another SA in a bit-level fashion. The SA architectures provide an optimal platform for the efficient HW-level implementation of an amount of reconstructive signal processing (SP) algorithms as coprocessor accelerators [<xref ref-type="bibr" rid="b27-sensors-12-02539">27</xref>,<xref ref-type="bibr" rid="b28-sensors-12-02539">28</xref>]. In this study, the implementation of a custom high-speed architecture, <italic>i.e</italic>., the dual SSA core, represents a new paradigm in the design of HPECs architectures which drastically reduce the processing time of the addressed reconstructive SP technique. <xref ref-type="fig" rid="f2-sensors-12-02539">Figure 2</xref> presents a multiprocessor system on chip (MPSoC) platform for the enhancement/reconstruction of RS algorithms via the HW/SW co-design paradigm.</p>
<p>The first stage of the SSA-based design flow of <xref ref-type="fig" rid="f2-sensors-12-02539">Figure 2</xref> consists in transforming the nested loop algorithms of the selected RS-reconstructive operations, in a parallel algorithmic representation with local and regular dependencies [<xref ref-type="bibr" rid="b27-sensors-12-02539">27</xref>,<xref ref-type="bibr" rid="b28-sensors-12-02539">28</xref>]. Next, with the tiling technique, the large-scale index space is divided into regular tiles (or blocks) of a real-size RS scene frame, and then traversing the tiles to cover the whole index space [<xref ref-type="bibr" rid="b27-sensors-12-02539">27</xref>–<xref ref-type="bibr" rid="b29-sensors-12-02539">29</xref>]. Finally, the dual SSA core is developed as a co-processor structure. The main challenge of this study is to present a methodology for the development of such dual SSAs core from the addressed reconstructive signal processing operations and also for the generation of the efficient control system. This is one of the major contributions of this paper due the lack of HPEC tools and also due the lack of control system methodologies.</p>
<p>In <xref ref-type="fig" rid="f3-sensors-12-02539">Figure 3</xref>, the conceptualization of the fixed-point dual SSA core is depicted. From the analysis of <xref ref-type="fig" rid="f3-sensors-12-02539">Figure 3</xref>, one can deduce the dual SSA machine running in parallel, and then, the element by element Shur-Hadamar operation, for the implementation of the optimal reconstructive RS estimator of <xref ref-type="disp-formula" rid="FD1">Equation (1)</xref>. This SSA efficiently computes the complex SSP estimation of the RS algorithms. Notice that at this implementation stage, <xref ref-type="fig" rid="f3-sensors-12-02539">Figure 3</xref> only describes the HW-level architecture at a coarse grain detail. Through this figure, one also can deduce how such complex matrix operators are working in order to perform optimal reconstructive RS estimator.</p>
<p>Both SSA architectures perform the discrete-form representation of the desired spatial spectrum pattern (SSP) in a high-performance structure. The multiply-accumulate (MAC) operation implemented in each processing element (PE) is now depicted in <xref ref-type="fig" rid="f4-sensors-12-02539">Figure 4</xref>.</p>
<p>The internal structure of each PE presented in <xref ref-type="fig" rid="f4-sensors-12-02539">Figure 4</xref> contains a multiplier and an adder. Each PE receives 32-bits operands and generates 64-bits product. Then, the product is truncated and then, rounded into 32-bits using a nearest rounding scheme with a fixed-point adopted representation of 9 integers and 24 decimals. The bit-level SSA representation of this MAC module will be presented further on in Section 3.4.</p>
<sec>
<label>3.1.</label>
<title>Parallel Algorithm Transformation</title>
<p>The algorithm of the selected RS-reconstructive operations, <italic>i.e</italic>., the <bold>b̂</bold> = (<bold>FU</bold>) ⊙ (<bold>FU</bold>)<sup>+</sup> can be represented by nested loops or FOR-loops programs. First, let us define from the reconstructive RS estimator of <xref ref-type="disp-formula" rid="FD1">Equation (1)</xref>, the <italic>n</italic> × <italic>m</italic> matrix <bold>F</bold> and the vector <bold>u</bold> of dimension <italic>m</italic> as follows:
<disp-formula id="FD5">
<label>(5)</label>
<mml:math display="block">
<mml:mrow>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="bold">Fu</mml:mi></mml:mrow></mml:math></disp-formula>where <bold>y</bold> is an <italic>n-</italic>dimensional (<italic>n</italic>-<italic>D</italic>) output vector. The <italic>j-th</italic> element of <bold>y</bold> is computed as:
<disp-formula id="FD6">
<label>(6)</label>
<mml:math display="block">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>y</mml:mi></mml:mrow>
<mml:mi>j</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>n</mml:mi></mml:munderover>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>F</mml:mi></mml:mrow>
<mml:mi mathvariant="italic">ji</mml:mi></mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>u</mml:mi></mml:mrow>
<mml:mi>i</mml:mi></mml:msub></mml:mrow>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>...</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>m</mml:mi></mml:mrow></mml:math></disp-formula>where <italic>F<sub>ji</sub></italic> represents the corresponding element of <bold>F</bold>.</p>
<p>Next, the localization method converts the algorithm into an algorithmic representation with local and regular dependencies [<xref ref-type="bibr" rid="b27-sensors-12-02539">27</xref>,<xref ref-type="bibr" rid="b28-sensors-12-02539">28</xref>]. The following algorithm achieves locality via affine scheduling transformations as presented below:
<disp-formula id="FD7">
<label>(7)</label>
<mml:math display="block">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">input operations</mml:mi></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mtable columnalign="left">
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>←</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>F</mml:mi></mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mo>∀</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>≤</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>≤</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>;</mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>≤</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>≤</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>←</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>u</mml:mi></mml:mrow>
<mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mo>∀</mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mi>j</mml:mi>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>≤</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>≤</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>←</mml:mo>
<mml:mn>0</mml:mn></mml:mrow></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mo>∀</mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mi>i</mml:mi>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>≤</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>≤</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi> </mml:mi>
<mml:mi mathvariant="italic">computations</mml:mi></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi> </mml:mi>
<mml:mi mathvariant="italic">for</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&lt;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">{</mml:mo></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>;</mml:mo></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi mathvariant="italic">for</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>;</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>&lt;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">{</mml:mo></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi>u</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>;</mml:mo></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi> </mml:mi>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>=</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>F</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>⋅</mml:mo>
<mml:mi>u</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>;</mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo stretchy="false">}</mml:mo>
<mml:mo> </mml:mo>
<mml:mo stretchy="false">}</mml:mo></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">output operations</mml:mi></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mtable columnalign="left">
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>←</mml:mo>
<mml:mi>y</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mo>∀</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo stretchy="false">|</mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>≤</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>≤</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>;</mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>≤</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>≤</mml:mo>
<mml:mi>m</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable>
<mml:mo>.</mml:mo></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">where the index space is</mml:mi></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>I</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">{</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mi mathvariant="normal">T</mml:mi></mml:msup>
<mml:mo>∈</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi>ℤ</mml:mi></mml:mrow>
<mml:mn>2</mml:mn></mml:msup>
<mml:mo stretchy="false">|</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>≤</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>≤</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>;</mml:mo>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>≤</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>≤</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">}</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>Once the algorithm is transformed into their localized representation (<italic>i.e</italic>., locally recursive form), one is ready to proceed with the tiling procedure in order to achieve realistic large-scale RS structures with the fixed-sized SA architecture.</p></sec>
<sec>
<label>3.2.</label>
<title>Tiling Transformation Technique</title>
<p>The tiling technique is a well known loop transformation used to automatically create sub-block algorithms [<xref ref-type="bibr" rid="b26-sensors-12-02539">26</xref>,<xref ref-type="bibr" rid="b29-sensors-12-02539">29</xref>,<xref ref-type="bibr" rid="b30-sensors-12-02539">30</xref>]. The advantage of this method is that, while computing within a block, there is a high degree of data locality, allowing better performance. The <italic>tiling</italic> procedure consist of dividing the large-scale index space defined by the loop structures into regular tiles (or blocks) of some real-scale size and shape RS complex operations, and then traversing the tiles to cover the whole index space [<xref ref-type="bibr" rid="b30-sensors-12-02539">30</xref>]. The conventional <italic>tiling</italic> procedure combine two well-known transformations: loop permutation and strip-mining. The loop permutation is used to establish the order in which the iterations inside the tiles are traversed and the strip-mining transformation is used to partition one dimension of the index space into strips. It also decomposes a single loop into two nested loops; the outer loop steps between strips of consecutive indexes, and the inner loop traverses the indexes within a strip. Both transformations can be obtained using the theory of unimodular transformations and, to compute the exact bounds, the Fourier-Motzkin elimination algorithm [<xref ref-type="bibr" rid="b30-sensors-12-02539">30</xref>] is applied.</p>
<p>Now, considering the locally recursive representation presented in <xref ref-type="disp-formula" rid="FD7">Equation (7)</xref>, the strip-mining transformation is applied to the outermost loop in order to perform the one-dimensional partition of the <italic>i</italic>-index algorithm. The resulting index partition is represented as follows:
<disp-formula id="FD8">
<label>(8)</label>
<mml:math display="block">
<mml:mrow>
<mml:mtable columnalign="left">
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mi mathvariant="italic">for</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&lt;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow/></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mi mathvariant="italic">for</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>&lt;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">StSizei</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mi mathvariant="italic">for</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>;</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>&lt;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mover>
<mml:mo stretchy="true">→</mml:mo>
<mml:mrow>
<mml:mtext>strip</mml:mtext>
<mml:mo>−</mml:mo>
<mml:mtext>mining</mml:mtext></mml:mrow></mml:mover></mml:mrow></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mi mathvariant="italic">for</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>≤</mml:mo>
<mml:mtext>min</mml:mtext>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">StSizei</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo> </mml:mo>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mi mathvariant="italic">Loop body</mml:mi></mml:mrow></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow/></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mi mathvariant="italic">for</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>;</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>&lt;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow/></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow/></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mi mathvariant="italic">Loop body</mml:mi>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow></mml:math></disp-formula>where <italic>tile_i</italic> represents the <italic>i</italic>-index-tile loop, <italic>i</italic> and <italic>j</italic> are the inner element’s loops and <italic>StSize</italic> is the strip size.</p>
<p>The second step of the tiling procedure consists in implement the loop permutation transformation based on the Polytope model [<xref ref-type="bibr" rid="b30-sensors-12-02539">30</xref>]. For the loop permutation, the following unimodular transformation 
<inline-formula>
<mml:math>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">P</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> was applied in order to permute the index-space of the locally recursive algorithm <bold>I</bold> = [<italic>i j</italic>]<sup>T</sup> into the required new index-space <bold>I<sub>P</sub></bold> = [<italic>i’ j’</italic>]<sup>T</sup>=[<italic>j i</italic>]<sup>T</sup>. In this step, it is defined the Polytope model as a set of inequations such as <bold>ΓI<sub>P</sub></bold> ≤ <bold>H</bold>, where <bold>I<sub>P</sub></bold> = [<italic>i’ j’</italic>]<sup>T</sup> represents the index-space after the <italic>i</italic>-strip-mining procedure, and the matrix <bold>Γ</bold> and vector <bold>H</bold> represents the boundaries of each FOR-loop of the algorithm presented above in <xref ref-type="disp-formula" rid="FD8">Equation (8)</xref>.</p>
<p>The source Polytope is described in a convex form by a set of half-spaces, where the intersection of all half-spaces corresponds to the Polytope and the target representation is presented as follows:
<disp-formula id="FD9">
<label>(9)</label>
<mml:math display="block">
<mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi mathvariant="bold">Γ</mml:mi>
<mml:mo> </mml:mo>
<mml:mi mathvariant="bold">I</mml:mi>
<mml:mo>≤</mml:mo>
<mml:mi mathvariant="bold">H</mml:mi></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">P</mml:mi></mml:msub>
<mml:mo> </mml:mo>
<mml:mi mathvariant="bold">I</mml:mi>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">I</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">P</mml:mi></mml:msub></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>}</mml:mo></mml:mrow>
<mml:mo> </mml:mo>
<mml:mi> </mml:mi>
<mml:mo>⇒</mml:mo>
<mml:mi> </mml:mi>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi mathvariant="bold">Γ</mml:mi>
<mml:mo> </mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msubsup>
<mml:mo> </mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">I</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">P</mml:mi></mml:msub>
<mml:mo>≤</mml:mo>
<mml:mi mathvariant="bold">H</mml:mi></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi mathvariant="bold">Γ</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">I</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">P</mml:mi></mml:msub>
<mml:mo>≤</mml:mo>
<mml:mi mathvariant="bold">H</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable>
<mml:mi> </mml:mi>
<mml:mo>⇒</mml:mo>
<mml:mi> </mml:mi>
<mml:munder>
<mml:munder>
<mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow></mml:mrow>
<mml:mo stretchy="true">︸</mml:mo></mml:munder>
<mml:mi mathvariant="bold">Γ</mml:mi></mml:munder>
<mml:mo> </mml:mo>
<mml:munder>
<mml:munder>
<mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow></mml:mrow>
<mml:mo stretchy="true">︸</mml:mo></mml:munder>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msubsup></mml:mrow></mml:munder>
<mml:mo> </mml:mo>
<mml:munder>
<mml:munder>
<mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>′</mml:mo></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>′</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow></mml:mrow>
<mml:mo stretchy="true">︸</mml:mo></mml:munder>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">I</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">P</mml:mi></mml:msub></mml:mrow></mml:munder>
<mml:mo> </mml:mo>
<mml:mo>≤</mml:mo>
<mml:mo> </mml:mo>
<mml:munder>
<mml:munder>
<mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">StSize</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow></mml:mrow>
<mml:mo stretchy="true">︸</mml:mo></mml:munder>
<mml:mi mathvariant="bold">H</mml:mi></mml:munder>
<mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The new loop bounds are derived with the Fourier-Motzkin elimination algorithm. The resulting reconstructive RS algorithm after the loop permutation transformation is now presented, in which the proper substitutions are integrated:
<disp-formula id="FD10">
<label>(10)</label>
<mml:math display="block">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">for</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>&lt;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">StSizei</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">{</mml:mo></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">for</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>;</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>&lt;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">{</mml:mo></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">for</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>≤</mml:mo>
<mml:mtext>min</mml:mtext>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">StSizei</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo> </mml:mo>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">{</mml:mo></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">Loop body</mml:mi>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>i</mml:mi>
<mml:mo> </mml:mo>
<mml:mo stretchy="false">]</mml:mo></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mo stretchy="false">}</mml:mo>
<mml:mo stretchy="false">}</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>The third step of the tiling procedure corresponds again to the strip-mining transformation procedure but in this case, the procedure is applied over the <italic>j</italic>-index. Furthermore, the resulting tiled algorithm after this strip-mining transformation is next represented. The final step consists in to employ again the loop permutation. This final transformation is required to order the inner loop for the final tiled algorithm represented as follows:
<disp-formula id="FD11">
<label>(11)</label>
<mml:math display="block">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">for</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>&lt;</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">StSizei</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">{</mml:mo></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">for</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi mathvariant="italic">tile_j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="italic">tile_j</mml:mi>
<mml:mo>&lt;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="italic">tile_j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">tile_j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">StSizej</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">{</mml:mo></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">for</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">tile_j</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>≤</mml:mo>
<mml:mtext>min</mml:mtext>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">tile_j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">StSizej</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">{</mml:mo></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="italic">for</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>≤</mml:mo>
<mml:mtext>min</mml:mtext>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi mathvariant="italic">tile_i</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi mathvariant="italic">StSizei</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">]</mml:mo>
<mml:mo>;</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>+</mml:mo>
<mml:mo>+</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">{</mml:mo></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mi mathvariant="italic">Loop body</mml:mi>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo stretchy="false">[</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo> </mml:mo>
<mml:mi>j</mml:mi>
<mml:mo stretchy="false">]</mml:mo></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mo> </mml:mo>
<mml:mo> </mml:mo>
<mml:mo stretchy="false">}</mml:mo>
<mml:mo stretchy="false">}</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula></p>
<p>From the analysis of the tiled parallel algorithm of <xref ref-type="disp-formula" rid="FD11">Equation (11)</xref>, one is ready to deduce the dual SSA core and the local control system architecture presented in <xref ref-type="fig" rid="f2-sensors-12-02539">Figure 2</xref>.</p></sec>
<sec>
<label>3.3.</label>
<title>Space-Time Mapping Onto Fixed-Sized SAs</title>
<p>The space-time mapping procedure onto SAs is a technique that transforms an index-space representation into a space-time representation where each node of their iteration node is mapped to a certain PE and it is scheduled to a certain instance of time [<xref ref-type="bibr" rid="b27-sensors-12-02539">27</xref>,<xref ref-type="bibr" rid="b28-sensors-12-02539">28</xref>]. Recall that the SA is a space-time representation of the computational operations, in which the function description defines the behavior within a node, whereas the structural description specifies the interconnections (edges and delays) between the corresponding graph nodes [<xref ref-type="bibr" rid="b27-sensors-12-02539">27</xref>]. In order to derive a SA architecture with a minimum possible number of nodes, we address a linear projection approach for processor assignment, <italic>i.e</italic>., the nodes of the structure array in a certain straight line are to be properly projected onto the corresponding PEs of the SA represented by the corresponding assignment projection vector <bold>d</bold>. Thus, we seek for a linear order reduction, in which the transformation <bold>T<sub>m</sub></bold> : <bold>G</bold><italic><sup>N</sup></italic> → <bold>Ĝ</bold><sup><italic>N</italic>−1</sup> maps the <italic>N</italic>-dimensional dependence graph (<bold>G</bold><italic><sup>N</sup></italic>) onto the (<italic>N</italic>−1)-dimensional SA (<bold>Ĝ</bold><sup><italic>N</italic>−1</sup>), where <italic>N</italic> represents the dimension of their dependence graph (see proofs in [<xref ref-type="bibr" rid="b19-sensors-12-02539">19</xref>,<xref ref-type="bibr" rid="b27-sensors-12-02539">27</xref>] and details in [<xref ref-type="bibr" rid="b30-sensors-12-02539">30</xref>]). Moldovan in [<xref ref-type="bibr" rid="b31-sensors-12-02539">31</xref>], proved the mapping theory, as follows:
<disp-formula id="FD12">
<label>(12)</label>
<mml:math display="block">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">m</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="bold">Π</mml:mi></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="bold">Σ</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow>
<mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>where <bold>Π</bold> is a (1 × <italic>N</italic>) − <italic>D</italic> vector (composed of the first row of <bold>T<sub>m</sub></bold>) which (in the partitioning terms [<xref ref-type="bibr" rid="b19-sensors-12-02539">19</xref>,<xref ref-type="bibr" rid="b29-sensors-12-02539">29</xref>]) determines the time scheduling, and the (<italic>N</italic> − 1) × <italic>N</italic> sub-matrix <bold>Σ</bold> in Equation (29) is composed of the rest rows of <bold>T<sub>m</sub></bold> that determine the space processor specified by the so-called projection vector <bold>d</bold> [<xref ref-type="bibr" rid="b19-sensors-12-02539">19</xref>,<xref ref-type="bibr" rid="b31-sensors-12-02539">31</xref>]. Next, such partitioning (12) yields the regular SA of (<italic>N</italic> – 1) – <italic>D</italic> specified by the mapping:
<disp-formula id="FD13">
<label>(13)</label>
<mml:math display="block">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">m</mml:mi></mml:msub>
<mml:mo> </mml:mo>
<mml:mi mathvariant="bold">Φ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="bold">K</mml:mi>
<mml:mo>,</mml:mo></mml:mrow></mml:math></disp-formula>where <bold>K</bold> is composed of the new revised vector schedule (represented by the first row of the SA) and the inter-processor communications (represented by the rest rows of the SA), and the matrix <bold>Φ</bold> specifies the data dependencies of the parallel representation algorithm. For a more detailed explanation of the mapping theory, see [<xref ref-type="bibr" rid="b19-sensors-12-02539">19</xref>,<xref ref-type="bibr" rid="b27-sensors-12-02539">27</xref>,<xref ref-type="bibr" rid="b28-sensors-12-02539">28</xref>]. Next, we define the following specifications for performing the mapping of the fixed-sized reconstructive RS operation, <italic>i.e</italic>., the <bold>y</bold> = <bold>Fu</bold> algorithm of <xref ref-type="disp-formula" rid="FD1">Equation (1)</xref>, onto each parallel SA core: <bold>Π</bold> = [1 1] specifies the vector schedule, <bold>d</bold> = [1 0] specifies the projection vector, and <bold>Σ</bold> = [0 1] specifies the corresponding space processor.</p>
<p>With these specifications, the transformation matrix becomes 
<inline-formula>
<mml:math>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="bold">Π</mml:mi></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi mathvariant="bold">Σ</mml:mi></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>. Next, we specify the dependence vectors of the locally recursive algorithm: 
<inline-formula>
<mml:math>
<mml:mrow>
<mml:mi mathvariant="bold">Φ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">[</mml:mo>
<mml:mtable columnalign="left">
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Φ</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">F</mml:mi></mml:msub></mml:mrow></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Φ</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">u</mml:mi></mml:msub></mml:mrow></mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Φ</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">y</mml:mi></mml:msub></mml:mrow></mml:mtd></mml:mtr></mml:mtable>
<mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:math></inline-formula>, where 
<inline-formula>
<mml:math>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Φ</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">F</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>, 
<inline-formula>
<mml:math>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Φ</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">u</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> and 
<inline-formula>
<mml:math>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">Φ</mml:mi></mml:mrow>
<mml:mi mathvariant="bold">y</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula> represent the dependencies of the corresponding variables in the algorithm. These specifications result in the following SA dependencies:
<disp-formula id="FD14">
<label>(14)</label>
<mml:math display="block">
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
<mml:mi mathvariant="bold">Φ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="bold">K</mml:mi>
<mml:mi> </mml:mi>
<mml:mo>→</mml:mo>
<mml:mi> </mml:mi>
<mml:mo> </mml:mo>
<mml:munder>
<mml:munder>
<mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow></mml:mrow>
<mml:mo stretchy="true">︸</mml:mo></mml:munder>
<mml:mi mathvariant="bold">T</mml:mi></mml:munder>
<mml:mo> </mml:mo>
<mml:munder>
<mml:munder>
<mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow></mml:mrow>
<mml:mo stretchy="true">︸</mml:mo></mml:munder>
<mml:mi mathvariant="bold">Φ</mml:mi></mml:munder>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:munder>
<mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow></mml:mrow>
<mml:mo stretchy="true">︸</mml:mo></mml:munder>
<mml:mi mathvariant="bold">K</mml:mi></mml:munder>
<mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>The number of PEs required by each coarse grain SA of the dual SSA core architecture is <italic>n</italic>, and the required computational time is 2<italic>n</italic> – 1 clock periods. In <xref ref-type="fig" rid="f3-sensors-12-02539">Figures 3</xref> and <xref ref-type="fig" rid="f4-sensors-12-02539">4</xref>, how the SA architecture is implemented at a coarse grain detail for the reconstructive processing of realistic large-scale RS scenes (e.g., 1<italic>K</italic> × 1<italic>K</italic> pixel size) by reusing the same fixed-sized SA architecture for each partitioned scene frame is conceptualized. At this stage, the scalability in terms of HW resources can be analyzed varying the number of PEs of the fixed-sized SA architecture.</p></sec>
<sec>
<label>3.4.</label>
<title>Bit-Level Fixed-Sized Dual SSAs Core</title>
<p>Once the coarse grain SAs of the selected RS reconstructive algorithm have been defined, we are ready to conceptualize and implement the bit-level fixed-sized dual SSA core. The internal structure of each fixed-sized SA contains identical PEs linearly-connected also in a systolic fashion. The same SA-based design flow, implemented in the previous sub-sections (<italic>i.e</italic>., algorithmic implementation, tiling and mapping techniques), is again employed for the bit-level dual SSA core as an accelerator structure. <xref ref-type="fig" rid="f5-sensors-12-02539">Figure 5</xref> illustrates the fixed-sized bit-level SSA architecture (<italic>i.e</italic>., at a fine grain detail) of each PE of the previously conceptualized RS reconstructive operation.</p>
<p>From the analysis of <xref ref-type="fig" rid="f5-sensors-12-02539">Figure 5</xref>, one can note the improvement achieved with this highly-pipelined architecture in terms of its hardware performance. The bit-level multiply accumulate (MAC) structure of each PE is described as follows: the architecture receives 32-bits operands and generates 64-bits product. The multiplexor in the figure performs the truncate function of the bit-level MAC operation implemented by the array of logic cells. Finally, the logic full-adder implements the rounded function for a better performance.</p>
<p>The SA for performing the bit-level MAC operation of each PE of the dual SSA core employs the following specifications in the transformation defined by <xref ref-type="disp-formula" rid="FD12">Equation (12)</xref>: <bold>Π</bold> = [1 2] specifies the vector schedule, <bold>d</bold> = [1 0] specifies the projection vector and <bold>Σ</bold> = [0 1] specifies the corresponding space processor. The dependence matrix of the MAC algorithm is specified by 
<inline-formula>
<mml:math>
<mml:mrow>
<mml:mi mathvariant="bold">Φ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>. For the mapping-optimized projection vector <bold>d</bold> = [1 0], these specifications yield the following SA structure:
<disp-formula id="FD15">
<label>(15)</label>
<mml:math display="block">
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
<mml:mi mathvariant="bold">Φ</mml:mi>
<mml:mo>=</mml:mo>
<mml:mi mathvariant="bold">K</mml:mi>
<mml:mi> </mml:mi>
<mml:mo>→</mml:mo>
<mml:mi> </mml:mi>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>2</mml:mn></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow>
<mml:mo> </mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>2</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd>
<mml:mtd>
<mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow>
<mml:mo>]</mml:mo></mml:mrow>
<mml:mo>.</mml:mo></mml:mrow></mml:math></disp-formula></p>
<p>This bit-level SA-based MAC architecture requires an array of <italic>ρ</italic> bit-level multiply-accumulate operations with 3<italic>ρ</italic> − 2 clock periods. In this study, we consider 32 bits operands (<italic>i.e</italic>., <italic>ρ</italic> = 32). The high-performance analysis achieved with this dual SSA core architecture will be presented further on in Section 4.</p></sec></sec>
<sec sec-type="results">
<label>4.</label>
<title>Implementation Results</title>
<p>In this section, the results of the hardware-level implementation of the reconstructive complex RS functions with the employment of a high-speed fixed-sized dual SSA core accelerator are reported. The addressed architecture drastically reduce the computationally load of the enhancement/reconstruction real-world Geospatial images acquired with different fractional multisensory SAR systems. In order to demonstrate the best area-time trade-off of the digital implementation and the high accuracy of the proposed RS hardware accelerator, the authors have conducted some test-case scenarios of the real-world RS images characterized by the point spread function (PSF) of a Gaussian “bell” shape in both directions of the 2-D scene (in particular, of 16 pixel width at 0.5 from its maximum for the 1<italic>k</italic>-by-1<italic>k</italic> BMP pixel-formatted scene) with the selected FPGA target platform.</p>
<sec sec-type="methods">
<label>4.1.</label>
<title>Architecture Performance Analysis</title>
<p>The comparative performance analysis of the HW-level implementation of the dual SSA core architecture is presented in this sub-section. Such HW-level performance analysis is focused on define which is the best area-time tradeoff. The Xilinx XST tool of the Integrated Software Environment (ISE™) WebPACK™ was used for the synthesis of the proposed architecture. The clock frequency of 100 MHz is the selected timing constrain considered for the synthesis procedure. The following parameters were considered in the synthesis: the order of each fixed-sized SSA is <italic>m</italic> = 64 and the data-output sample word length of 32 bits. In <xref ref-type="table" rid="t1-sensors-12-02539">Table 1</xref>, it is reported the synthesis results evaluated by different metrics that are indicative of the efficiency of the proposed reconfigurable architecture with respect to the selected FPGA-targets Xilinx Virtex-5 XC5VFX130T and Virtex-4 XC4VSX35.</p>
<p>From the analysis of <xref ref-type="table" rid="t1-sensors-12-02539">Table 1</xref>, one can conclude that one of the relevant implementation results is related to the high-speed and high-throughput performance, in which the proposed architecture is able to run up to 920.93 MHz.</p>
<p>Next, the scalability analysis in terms of HW resources is presented for the relevant dual SSA core architecture in <xref ref-type="fig" rid="f6-sensors-12-02539">Figure 6</xref>.</p>
<p>In such analysis, the number of precision bits and the number of processing elements (PEs) are modified in order to estimate the HW resources. The results reveal the area resource utilization of the dual SSA-based architecture in the FPGA-target platform.</p>
<p>Other alternative implementations for RS applications that implement specialized HW architectures (<italic>i.e</italic>., SA-based) are presented in [<xref ref-type="bibr" rid="b16-sensors-12-02539">16</xref>,<xref ref-type="bibr" rid="b17-sensors-12-02539">17</xref>,<xref ref-type="bibr" rid="b32-sensors-12-02539">32</xref>,<xref ref-type="bibr" rid="b33-sensors-12-02539">33</xref>]. For example, in [<xref ref-type="bibr" rid="b32-sensors-12-02539">32</xref>], a digital custom space-based FPGA co-processor for high-performance space computing is presented. However, in the design of such coprocessors do not consider an MPSoC scheme for a parallel real-time application. Another approach for high-speed computational implementation of reconstructive RS image processing based on the use of clusters of PCs was presented by Yang <italic>et al</italic>. in [<xref ref-type="bibr" rid="b33-sensors-12-02539">33</xref>], in which the cluster NSPO Parallel TestBed for performing parallel radiometric and geometrical corrections of the large-scale 3,600 × 2,944-pixel RS images was implemented. The reconstructive image processing was conducted using a PC-Cluster composed by three PCs each one with a Pentium-III 550 MHz with 128 MB of RAM connected with 100 Mbps Fast-Ethernet LAN. The processing time achieved with such three-PC cluster was only 33.3 s (near-real time for conventional RS users), while the corresponding processing performed with one single processor required 84.65 s. Once more, the authors believe that this dual SSA core is unique and completely differs from other specialized HW architectures in recent RS applications.</p>
<p>In the next section, a concrete real-world Geospatial test application from multi-sensor array SAR systems scenario is presented. This test will evaluate the accuracy of the proposed dual SSA core that it is integrated in a MPSoC design via the HW/SW co-design. The reported results of such enhancement/reconstructive model will be also discussed further on in the next sub-section.</p></sec>
<sec sec-type="methods">
<label>4.2.</label>
<title>High-Resolution Enhancement/Reconstruction of RADAR/SAR Imagery: A Test Case Study</title>
<p>In this sub-section, we present an illustrative test case study related to the HW-implementation of the Weighted Constrain Least Square (WCLS) regularization technique for the enhancement/reconstruction of RS applications. This HW-implementation is based on the proposed here, dual SSA core in aggregation with a Microblaze embedded processor and the On Chip Peripheral Bus (OPB) for transferring the data to/from the embedded processor. In the HW design, we consider to use the precision of 32 bits fixed-point, 9-bit integer and 23-bits decimal for the implementation of all fixed-point operations in each SSA core. Once the HW/SW co-design methodology has been employed, we are ready to establish the verification statements to evaluate the accuracy of the MPSoC system.</p>
<p>In the HW implementation, a large scale (1<italic>K</italic>-by-1<italic>K</italic>) pixel format RS image borrowed from the real-world high-resolution terrain SAR was employed. The quantitative evaluation of the RS reconstruction performances was employed using the following quality metric defined by the improvement in the output signal-to-noise ratio (IOSNR) [<xref ref-type="bibr" rid="b34-sensors-12-02539">34</xref>]. In this evaluation, the signal formation operator of all RS images is factorized along two axes in the image plane: the azimuth (horizontal axis, <italic>x</italic>) and the range (vertical axis, <italic>y</italic>). Following the common practically motivated technical considerations [<xref ref-type="bibr" rid="b3-sensors-12-02539">3</xref>,<xref ref-type="bibr" rid="b4-sensors-12-02539">4</xref>,<xref ref-type="bibr" rid="b15-sensors-12-02539">15</xref>–<xref ref-type="bibr" rid="b20-sensors-12-02539">20</xref>], we modeled the Gaussian shape in the SAR range PSF <italic>Ψ<sub>r</sub></italic>(<italic>y</italic>) in the range direction <italic>y</italic>, and the side-looking SAR azimuth PSF <italic>Ψ<sub>a</sub></italic>(<italic>x</italic>) in the cross-range direction <italic>x</italic> at the zero crossing level for the simulated SAR system with fractionally synthesized aperture.</p>
<p>The quantitative measures of the image enhancement/reconstruction performance gains achieved with the particular employed WCLS technique, evaluated via the IOSNR metric, are reported in <xref ref-type="table" rid="t2-sensors-12-02539">Table 2</xref> with two different real-world high-resolution scene images.</p>
<p>Next, the qualitative results are presented in <xref ref-type="fig" rid="f7-sensors-12-02539">Figures 7</xref> and <xref ref-type="fig" rid="f8-sensors-12-02539">8</xref>, with two different real-world high-resolution scenes. <xref ref-type="fig" rid="f7-sensors-12-02539">Figure 7(a,b)</xref> show the original test scene images. <xref ref-type="fig" rid="f7-sensors-12-02539">Figures 7(b)</xref> and <xref ref-type="fig" rid="f8-sensors-12-02539">8(b)</xref> present the noised low-resolution (degraded) scene images formed with the conventional MSF algorithm. <xref ref-type="fig" rid="f7-sensors-12-02539">Figures 7(c)</xref> and <xref ref-type="fig" rid="f8-sensors-12-02539">8(c)</xref> present the scene images reconstructed with the CLS-regularized algorithm. <xref ref-type="fig" rid="f7-sensors-12-02539">Figures 7(d)</xref> and <xref ref-type="fig" rid="f8-sensors-12-02539">8(d)</xref> present the scene images reconstructed employing the WCLS-regularized algorithm.</p>
<p>From the analysis of the qualitative and quantitative implementation results reported in <xref ref-type="fig" rid="f7-sensors-12-02539">Figures 7</xref> and <xref ref-type="fig" rid="f8-sensors-12-02539">8</xref>, and <xref ref-type="table" rid="t2-sensors-12-02539">Table 2</xref>, one may deduce that the dual SSA core was efficiently integrated in MPSoC embedded system via the HW/SW co-design method. Additionally, such WCLS implementation results over-perform the robust non-adaptive CLS in all simulated scenarios.</p>
<p>Finally, in <xref ref-type="table" rid="t3-sensors-12-02539">Table 3</xref>, we report the processing times required for implementing the WCLS image reconstruction algorithms using the developed dual SSA core in a MPSoC embedded system.</p>
<p>In the first case in <xref ref-type="table" rid="t3-sensors-12-02539">Table 3</xref>, the WCLS algorithm was implemented in C++ software in a personal computer (PC) running at 3 GHz with a AMD Athlon (tm) 64 dual-core processor and 2 GB of RAM memory. In the second case, the same WCLS-related algorithm was implemented using the proposed here efficient architecture approach with the specialized dual SSA core employed in the Xilinx FPGA Virtex-5 XC5VFX130T.</p>
<p>The implementation of this high-speed architecture helps to drastically reduce the overall processing time. Particularly, the proposed implementation of the WCLS algorithm with the proposed HW-specialized architecture takes only 0.81 s for the large-scale RS image reconstruction in contrast to 12.6 s required with the C++ reference implementation. Thus, the processing time of the proposed dual SSA core-oriented architecture is approximately 16 times less than the corresponding processing time achievable with the conventional C++ PC-based implementation.</p>
<p>In this regard, the emergence of specialized hardware devices such as the dual SSA core in FPGAs represents a new paradigm to develop real-time systems for remote sensing data processing. The increasing computational demands of remote sensing applications can now be benefit from these compact hardware components, taking advantage of the small size and relatively low cost of these units as compared to clusters or networks of computers. These aspects are of great importance in other areas like hyperspectral imaging for Earth observation and remote sensing missions that require an extremely large number of spectral bands and high spatial resolution.</p></sec></sec>
<sec sec-type="conclusions">
<label>5.</label>
<title>Conclusions</title>
<p>In this study, a high-speed dual SSAs core which accelerates complex regularization operations for the real-time enhancement/reconstruction of large-scale RS imaging of radar/SAR sensor systems is presented. Also, the design methodology for real time implementation of such specialized arrays of processors using high performance embedded computing (HPEC) architecture was developed.</p>
<p>The dual SSA core was evaluated as follows: the architecture was aggregated with a Microblaze embedded processor in MPSoC structure via the HW/SW co-design paradigm. The WCLS regularization method was algorithmically adapted (using parallel computing techniques) and implemented in a real time computational mode (the ‘real-time’ being understood in a context of conventional RS users). The performance analysis and the qualitative and quantitative results reveal that the dual SSA core as accelerator units drastically increase the throughput and the processing time of large-scale real-time image processing requirements while performing the reconstruction of real-world hyperspectral RS imagery. In addition, the authors believe that the FPGA/DSP-based systems in aggregation with novel bit-level super-systolic architectures offer enormous computation potential in RS systems for newer Geospatial applications.</p></sec></body>
<back>
<ack>
<p>The study was supported by Consejo Nacional de Ciencia y Tecnología (México) under the grant CB-2010-01-158136.</p></ack>
<ref-list>
<title>References</title>
<ref id="b1-sensors-12-02539"><label>1.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Martínez</surname><given-names>D.R.</given-names></name><name><surname>Bond</surname><given-names>R.A.</given-names></name><name><surname>Vai</surname><given-names>M.M.</given-names></name></person-group><source>High Performance Embedded Computing Handbook: A Systems Perspective</source><publisher-name>CRC Press</publisher-name><publisher-loc>Boca Raton, FL, USA</publisher-loc><year>2008</year></citation></ref>
<ref id="b2-sensors-12-02539"><label>2.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Levesque</surname><given-names>J.</given-names></name><name><surname>Wagenbreth</surname><given-names>G.</given-names></name></person-group><source>High Performance Computing Programming and Applications</source><publisher-name>CRC Press</publisher-name><publisher-loc>Boca Raton, FL, USA</publisher-loc><year>2011</year></citation></ref>
<ref id="b3-sensors-12-02539"><label>3.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Henderson</surname><given-names>F.M.</given-names></name><name><surname>Lewis</surname><given-names>A.V.</given-names></name></person-group><source>Principles and Applications of Imaging Radar, Manual of Remote Sensing</source><edition>3rd ed</edition><publisher-name>Wiley</publisher-name><publisher-loc>New York, NY, USA</publisher-loc><year>1998</year></citation></ref>
<ref id="b4-sensors-12-02539"><label>4.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Barrett</surname><given-names>H.H.</given-names></name><name><surname>Myers</surname><given-names>K.J.</given-names></name></person-group><source>Foundations of Image Science</source><publisher-name>Wiley</publisher-name><publisher-loc>New York, NY, USA</publisher-loc><year>2004</year></citation></ref>
<ref id="b5-sensors-12-02539"><label>5.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Chang</surname><given-names>C.-I.</given-names></name></person-group><source>Hyperspectral Imaging: Techniques for Spectral Detectionand Classification</source><publisher-name>Kluwer Academic/Plenum</publisher-name><publisher-loc>New York, NY, USA</publisher-loc><year>2003</year></citation></ref>
<ref id="b6-sensors-12-02539"><label>6.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shkvarko</surname><given-names>Y.V.</given-names></name></person-group><article-title>Unifying regularization and Bayesian estimation methods for enhanced imaging with remotely sensed data—Part I: Theory</article-title><source>IEEE Trans. Geosci. Remote Sens</source><year>2004</year><volume>42</volume><fpage>923</fpage><lpage>931</lpage><pub-id pub-id-type="doi">10.1109/TGRS.2003.823281</pub-id></citation></ref>
<ref id="b7-sensors-12-02539"><label>7.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shkvarko</surname><given-names>Y.V.</given-names></name></person-group><article-title>Unifying regularization and Bayesian estimation methods for enhanced imaging with remotely sensed data—Part II: Implementation and performance issues</article-title><source>IEEE Trans. Geosci. Remote Sens</source><year>2004</year><volume>42</volume><fpage>932</fpage><lpage>940</lpage><pub-id pub-id-type="doi">10.1109/TGRS.2003.823279</pub-id></citation></ref>
<ref id="b8-sensors-12-02539"><label>8.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Plaza</surname><given-names>A.</given-names></name><name><surname>Valencia</surname><given-names>D.</given-names></name><name><surname>Plaza</surname><given-names>J.</given-names></name><name><surname>Martinez</surname><given-names>P.</given-names></name></person-group><article-title>Commodity cluster-based parallel processing of hyperspectral imagery</article-title><source>J. Parallel Distrib. Comp</source><year>2006</year><volume>66</volume><fpage>345</fpage><lpage>358</lpage><pub-id pub-id-type="doi">10.1016/j.jpdc.2005.10.001</pub-id></citation></ref>
<ref id="b9-sensors-12-02539"><label>9.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wei</surname><given-names>S.-C.</given-names></name><name><surname>Huang</surname><given-names>B.</given-names></name></person-group><article-title>GPU acceleration of predictive partitioned vector quantization for ultraspectral sounder data compression</article-title><source>IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens. (JSTARS)</source><year>2011</year><volume>4</volume><fpage>677</fpage><lpage>682</lpage><pub-id pub-id-type="doi">10.1109/JSTARS.2011.2132117</pub-id></citation></ref>
<ref id="b10-sensors-12-02539"><label>10.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Govett</surname><given-names>M.W.</given-names></name><name><surname>Middlecoff</surname><given-names>J.</given-names></name><name><surname>Henderson</surname><given-names>T.</given-names></name></person-group><article-title>Running the NIM next-generation weather model on GPUs</article-title><conf-name>Proceedings of the 10th IEEE/ACM International Conference Cluster, Cloud and Grid Computing (CCGrid)</conf-name><conf-loc>Melbourne, Australia</conf-loc><conf-date>17–20 May 2010</conf-date><volume>1</volume><fpage>792</fpage><lpage>796</lpage></citation></ref>
<ref id="b11-sensors-12-02539"><label>11.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Aanaes</surname><given-names>H.</given-names></name><name><surname>Sveinsson</surname><given-names>J.R.</given-names></name><name><surname>Nielsen</surname><given-names>A.A.</given-names></name><name><surname>Bovith</surname><given-names>T.</given-names></name><name><surname>Benediktsson</surname><given-names>J.A.</given-names></name></person-group><article-title>Integration of spatial spectral information for resolution enhancement in hyperspectral images</article-title><source>IEEE Trans. Geosci. Remote Sens</source><year>2008</year><volume>46</volume><fpage>1336</fpage><lpage>1346</lpage><pub-id pub-id-type="doi">10.1109/TGRS.2008.916475</pub-id></citation></ref>
<ref id="b12-sensors-12-02539"><label>12.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shkvarko</surname><given-names>Y.V.</given-names></name></person-group><article-title>Unifying experiment design and convex regularization techniques for enhanced imaging with uncertain remote sensing data––Part I: Theory</article-title><source>IEEE Trans. Geosci. Remote Sens</source><year>2010</year><volume>48</volume><fpage>82</fpage><lpage>95</lpage><pub-id pub-id-type="doi">10.1109/TGRS.2009.2027695</pub-id></citation></ref>
<ref id="b13-sensors-12-02539"><label>13.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shkvarko</surname><given-names>Y.V.</given-names></name></person-group><article-title>Unifying experiment design and convex regularization techniques for enhanced imaging with uncertain remote sensing data—Part II: Adaptive implementation and performance issues</article-title><source>IEEE Trans. Geosci. Remote Sens</source><year>2010</year><volume>48</volume><fpage>96</fpage><lpage>111</lpage><pub-id pub-id-type="doi">10.1109/TGRS.2009.2027696</pub-id></citation></ref>
<ref id="b14-sensors-12-02539"><label>14.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>De Maio</surname><given-names>A.</given-names></name><name><surname>Farina</surname><given-names>A.</given-names></name><name><surname>Foglia</surname><given-names>G.</given-names></name></person-group><article-title>Knowledge-aided Bayesian radar detectors and their application to live data</article-title><source>IEEE Trans. Aerosp. Electr. Syst</source><year>2010</year><volume>46</volume><fpage>170</fpage><lpage>183</lpage><pub-id pub-id-type="doi">10.1109/TAES.2010.5417154</pub-id></citation></ref>
<ref id="b15-sensors-12-02539"><label>15.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shkvarko</surname><given-names>Y.</given-names></name><name><surname>Perez-Meana</surname><given-names>H.</given-names></name><name><surname>Castillo-Atoche</surname><given-names>A.</given-names></name></person-group><article-title>Enhanced radar imaging in uncertain environment: A descriptive experiment design regularization approach</article-title><source>Int. J. Navig. Obs</source><year>2008</year><volume>2008</volume><fpage>1</fpage><lpage>11</lpage></citation></ref>
<ref id="b16-sensors-12-02539"><label>16.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Castillo Atoche</surname><given-names>A.</given-names></name><name><surname>Torres</surname><given-names>D.</given-names></name><name><surname>Shkvarko</surname><given-names>Y.V.</given-names></name></person-group><article-title>Experiment design regularization-based hardware/software co-design for real-time enhanced imaging in uncertain remote sensing environment</article-title><source>EURASIP J. Adv. Signal Process</source><year>2010</year><volume>2010</volume><fpage>1</fpage><lpage>21</lpage></citation></ref>
<ref id="b17-sensors-12-02539"><label>17.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Castillo Atoche</surname><given-names>A.</given-names></name><name><surname>Shkvarko</surname><given-names>Y.V.</given-names></name><name><surname>Torres</surname><given-names>D.</given-names></name><name><surname>Perez</surname><given-names>H.M.</given-names></name></person-group><article-title>Convex regularization-based hardware/software co-design for real-time enhancement of remote sensing imagery</article-title><source>Int. J. Real Time Image Process</source><year>2009</year><volume>4</volume><fpage>261</fpage><lpage>272</lpage><pub-id pub-id-type="doi">10.1007/s11554-009-0115-3</pub-id></citation></ref>
<ref id="b18-sensors-12-02539"><label>18.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shkvarko</surname><given-names>Y.V.</given-names></name><name><surname>Castillo Atoche</surname><given-names>A.</given-names></name><name><surname>Torres</surname><given-names>D.</given-names></name></person-group><article-title>Near real time enhancement of geospatial imagery via systolic implementation of neural network-adapted convex regularization techniques</article-title><source>Pattern Recognit. Lett</source><year>2011</year><volume>32</volume><fpage>2197</fpage><lpage>2205</lpage><pub-id pub-id-type="doi">10.1016/j.patrec.2011.05.018</pub-id></citation></ref>
<ref id="b19-sensors-12-02539"><label>19.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Castillo Atoche</surname><given-names>A.</given-names></name><name><surname>Torres</surname><given-names>D.</given-names></name><name><surname>Shkvarko</surname><given-names>Y.V.</given-names></name></person-group><article-title>Towards real time implementation of reconstructive signal processing algorithms using systolic arrays coprocessors</article-title><source>J. Syst. Archit. (JSA)</source><year>2010</year><volume>56</volume><fpage>327</fpage><lpage>339</lpage><pub-id pub-id-type="doi">10.1016/j.sysarc.2010.05.004</pub-id></citation></ref>
<ref id="b20-sensors-12-02539"><label>20.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shkvarko</surname><given-names>Y.V.</given-names></name><name><surname>Shmaliy</surname><given-names>Y.S.</given-names></name><name><surname>Jaime-Rivas</surname><given-names>R.</given-names></name><name><surname>Torres-Cisneros</surname><given-names>M.</given-names></name></person-group><article-title>System fusion in passive sensing using a modified Hopfield network</article-title><source>J. Frankl. Inst</source><year>2000</year><volume>338</volume><fpage>405</fpage><lpage>427</lpage></citation></ref>
<ref id="b21-sensors-12-02539"><label>21.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Castillo Atoche</surname><given-names>A.</given-names></name><name><surname>Estrada Lopez</surname><given-names>J.</given-names></name><name><surname>Pedro Muñoz</surname><given-names>P.</given-names></name><name><surname>Soto Aguilar</surname><given-names>S.</given-names></name></person-group><source>High-Speed VLSI Architectures Based on Massively Parallel Processor Arrays for Real Time Remote Sensing Applications</source><publisher-name>Intech</publisher-name><publisher-loc>Rijeka, Croatia</publisher-loc><year>2011</year><fpage>133</fpage><lpage>152</lpage></citation></ref>
<ref id="b22-sensors-12-02539"><label>22.</label><citation citation-type="web"><article-title>Fixed-Point ToolboxTM User’s Guide. MATLAB</article-title><comment>Available online: <ext-link xlink:href="http://www.mathworks.com/" ext-link-type="uri">http://www.mathworks.com/</ext-link> (accessed on 3 December 2011).</comment></citation></ref>
<ref id="b23-sensors-12-02539"><label>23.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>López-Vallejo</surname><given-names>M.</given-names></name><name><surname>López</surname><given-names>J.C.</given-names></name></person-group><article-title>On the hardware-software partitioning problem: System modeling and partitioning techniques</article-title><source>ACM Trans. Des. Autom. Electron. Syst</source><year>2003</year><volume>8</volume><fpage>269</fpage><lpage>297</lpage><pub-id pub-id-type="doi">10.1145/785411.785412</pub-id></citation></ref>
<ref id="b24-sensors-12-02539"><label>24.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Jin</surname><given-names>W.</given-names></name><name><surname>Zhang</surname><given-names>C.N.</given-names></name><name><surname>Li</surname><given-names>H.</given-names></name></person-group><article-title>Mapping multiple algorithms into a reconfigurable systolic array</article-title><conf-name>Proceedings of Canadian Conference on Electrical and Computer Engineering (CCECE 2008)</conf-name><conf-loc>Niagara Falls, ON, Canada</conf-loc><conf-date>4–7 May 2008</conf-date><fpage>1187</fpage><lpage>1192</lpage></citation></ref>
<ref id="b25-sensors-12-02539"><label>25.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Marquardt</surname><given-names>A.</given-names></name><name><surname>Betz</surname><given-names>V.</given-names></name><name><surname>Rose</surname><given-names>J.</given-names></name></person-group><article-title>Speed and area tradeoffs in cluster-based FPGA architectures</article-title><source>IEEE Trans. Very Large Scale Integr. Syst</source><year>2000</year><volume>8</volume><fpage>84</fpage><lpage>93</lpage><pub-id pub-id-type="doi">10.1109/92.820764</pub-id></citation></ref>
<ref id="b26-sensors-12-02539"><label>26.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Hauck</surname><given-names>S.</given-names></name><name><surname>DeHon</surname><given-names>A.</given-names></name></person-group><source>Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation</source><publisher-name>Morgan Kaufmann Publishers</publisher-name><publisher-loc>Burlington, MA, USA</publisher-loc><year>2008</year></citation></ref>
<ref id="b27-sensors-12-02539"><label>27.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Kung</surname><given-names>S.Y.</given-names></name></person-group><source>VLSI Array Processors</source><publisher-name>Prentice Hall</publisher-name><publisher-loc>Upper Saddle River, NJ, USA</publisher-loc><year>1988</year></citation></ref>
<ref id="b28-sensors-12-02539"><label>28.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Parhi</surname><given-names>K.K.</given-names></name></person-group><source>VLSI Digital Signal Processing Systems</source><publisher-name>John Wiley &amp; Sons</publisher-name><publisher-loc>Hoboken, NJ, USA</publisher-loc><year>1999</year></citation></ref>
<ref id="b29-sensors-12-02539"><label>29.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Dutta</surname><given-names>H.</given-names></name><name><surname>Hannig</surname><given-names>F.</given-names></name><name><surname>Teich</surname><given-names>J.</given-names></name></person-group><article-title>Controller synthesis for mapping partitioned programs on array architectures</article-title><conf-name>Proceedings of the 19th International Conference on Architecture of Computing Systems—ARCS ’2006</conf-name><conf-loc>Frankfurt/Main, Germany</conf-loc><conf-date>13–16 March 2006</conf-date></citation></ref>
<ref id="b30-sensors-12-02539"><label>30.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Barnerjee</surname><given-names>U.</given-names></name></person-group><source>Loop Transformation for Restructuring Compilers: The Foundations</source><publisher-name>Kluwer Academic Publishers</publisher-name><publisher-loc>Dordrecht, The Netherlands</publisher-loc><year>1993</year></citation></ref>
<ref id="b31-sensors-12-02539"><label>31.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Moldovan</surname><given-names>D.I.</given-names></name></person-group><article-title>On the design of algorithms for VLSI systolic arrays</article-title><source>Proc. IEEE</source><year>1983</year><volume>71</volume><fpage>113</fpage><lpage>120</lpage><pub-id pub-id-type="doi">10.1109/PROC.1983.12532</pub-id></citation></ref>
<ref id="b32-sensors-12-02539"><label>32.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Greco</surname><given-names>J.</given-names></name><name><surname>Cieslewski</surname><given-names>G.</given-names></name><name><surname>Jacobs</surname><given-names>A.</given-names></name><name><surname>Troxel</surname><given-names>I.A.</given-names></name><name><surname>George</surname><given-names>A.D.</given-names></name></person-group><article-title>Hardware/software interface for high-performance space computing with FPGA coprocessors</article-title><conf-name>Proceedings of IEEE Aerospace Conference (AECON ’06)</conf-name><conf-loc>Big Sky, MT, USA</conf-loc><conf-date>July 2006</conf-date></citation></ref>
<ref id="b33-sensors-12-02539"><label>33.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Yang</surname><given-names>C.T.</given-names></name><name><surname>Chang</surname><given-names>C.L.</given-names></name><name><surname>Hung</surname><given-names>C.C.</given-names></name><name><surname>Wu</surname><given-names>F.</given-names></name></person-group><article-title>Using a Beowulf cluster for a remote sensing application</article-title><conf-name>Proceedings of the 22nd Asian Conference on Remote Sensing</conf-name><conf-loc>Singapore</conf-loc><conf-date>5–9 November 2001</conf-date></citation></ref>
<ref id="b34-sensors-12-02539"><label>34.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ponomaryov</surname><given-names>V.I.</given-names></name></person-group><article-title>Real-time 2D–3D filtering using order statistics based algorithms</article-title><source>J. Real-Time Image Process</source><year>2007</year><volume>1</volume><fpage>173</fpage><lpage>194</lpage><pub-id pub-id-type="doi">10.1007/s11554-007-0021-5</pub-id></citation></ref></ref-list>
<sec sec-type="display-objects">
<title>Figures and Tables</title>
<fig id="f1-sensors-12-02539" position="float">
<label>Figure 1.</label>
<caption>
<p>MPSoC platform of RS algorithms via the HW/SW co-design paradigm.</p></caption>
<graphic xlink:href="sensors-12-02539f1.gif"/></fig>
<fig id="f2-sensors-12-02539" position="float">
<label>Figure 2.</label>
<caption>
<p>MPSoC platform of RS algorithms via the HW/SW co-design paradigm.</p></caption>
<graphic xlink:href="sensors-12-02539f2.gif"/></fig>
<fig id="f3-sensors-12-02539" position="float">
<label>Figure 3.</label>
<caption>
<p>Dual SSA core of the RS-related estimator.</p></caption>
<graphic xlink:href="sensors-12-02539f3.gif"/></fig>
<fig id="f4-sensors-12-02539" position="float">
<label>Figure 4.</label>
<caption>
<p>MAC operation of each PE.</p></caption>
<graphic xlink:href="sensors-12-02539f4.gif"/></fig>
<fig id="f5-sensors-12-02539" position="float">
<label>Figure 5.</label>
<caption>
<p>Bit-level SSA of the MAC structure.</p></caption>
<graphic xlink:href="sensors-12-02539f5.gif"/></fig>
<fig id="f6-sensors-12-02539" position="float">
<label>Figure 6.</label>
<caption>
<p>HW-resource scalability analysis: (<bold>a</bold>) varying the PEs for Virtex-4, (<bold>b</bold>) varying the PEs for Virtex-5, (<bold>c</bold>) varying the bits precision for Virtex-4 and (<bold>d</bold>) varying the bits precision for Virtex-5.</p></caption>
<graphic xlink:href="sensors-12-02539f6.gif"/></fig>
<fig id="f7-sensors-12-02539" position="float">
<label>Figure 7.</label>
<caption>
<p>Implementation results for the first observation scenario: (SNR μ = 10 dB): (<bold>a</bold>) Original tested scene; (<bold>b</bold>) degraded scene image formed applying the MSF method; (<bold>c</bold>) image reconstructed applying the CLS algorithm; (<bold>d</bold>) image reconstructed applying the WCLS algorithm.</p></caption>
<graphic xlink:href="sensors-12-02539f7.gif"/></fig>
<fig id="f8-sensors-12-02539" position="float">
<label>Figure 8.</label>
<caption>
<p>Implementation results for the second observation scenario: (SNR μ = 10 dB): (<bold>a</bold>) Original tested scene; (<bold>b</bold>) degraded scene image formed applying the MSF method; (<bold>c</bold>) image reconstructed applying the CLS algorithm; (<bold>d</bold>) image reconstructed applying the WCLS algorithm.</p></caption>
<graphic xlink:href="sensors-12-02539f8.gif"/></fig>
<table-wrap id="t1-sensors-12-02539" position="float">
<label>Table 1.</label>
<caption>
<p>Synthesis results of the proposed dual SSA core. SSA order: <italic>m</italic> = 64.</p></caption>
<table frame="box" rules="cols">
<thead>
<tr>
<th align="center" valign="bottom"><bold><italic>Device</italic>→</bold></th>
<th align="center" valign="bottom"><bold><italic>Virtex-4 XC4VSX35</italic></bold></th>
<th align="center" valign="bottom"><bold><italic>Virtex-5 XC5VFX130T</italic></bold></th></tr>
<tr>
<th colspan="3" align="center" valign="bottom">
<hr/></th></tr></thead>
<tbody>
<tr>
<td align="center" valign="middle">LUTs</td>
<td align="center" valign="middle">12,416</td>
<td align="center" valign="middle">12,416</td></tr>
<tr>
<td align="center" valign="middle">Slices</td>
<td align="center" valign="middle">10,624</td>
<td align="center" valign="middle">20,480</td></tr>
<tr>
<td align="center" valign="middle">Flip-Flops</td>
<td align="center" valign="middle">20,480</td>
<td align="center" valign="middle">12,288</td></tr>
<tr>
<td align="center" valign="middle">Output bit-width</td>
<td align="center" valign="middle">32</td>
<td align="center" valign="middle">32</td></tr>
<tr>
<td align="center" valign="middle">Max. Clock freq. (MHz)</td>
<td align="center" valign="middle">910.47</td>
<td align="center" valign="middle">920.93</td></tr></tbody></table></table-wrap>
<table-wrap id="t2-sensors-12-02539" position="float">
<label>Table 2.</label>
<caption>
<p>IOSNR of the WCLS algorithm evaluated for different SNRs.</p></caption>
<table frame="box" rules="cols">
<thead>
<tr>
<th align="center" valign="middle" rowspan="3"><bold>SNR [dB]</bold></th>
<th colspan="2" align="center" valign="middle"><bold>FIRST SCENARIO <italic>Ψ<sub>a</sub></italic>(<italic>x</italic>) = 13</bold></th>
<th colspan="2" align="center" valign="middle"><bold>SECOND SCENARIO <italic>Ψ<sub>a</sub></italic>(<italic>x</italic>) = 25</bold></th></tr>
<tr>
<th colspan="4" align="center" valign="middle">
<hr/></th></tr>
<tr>
<th align="center" valign="middle"><bold>IOSNR<sup>(CLS)</sup> [dB]</bold></th>
<th align="center" valign="middle"><bold>IOSNR<sup>(WCLS)</sup> [dB]</bold></th>
<th align="center" valign="middle"><bold>IOSNR<sup>(CLS)</sup> [dB]</bold></th>
<th align="center" valign="middle"><bold>IOSNR<sup>(WCLS)</sup> [dB]</bold></th></tr>
<tr>
<th colspan="5" align="center" valign="middle">
<hr/></th></tr></thead>
<tbody>
<tr>
<td align="center" valign="top">5</td>
<td align="center" valign="top">2.12</td>
<td align="center" valign="top">3.26</td>
<td align="center" valign="top">2.67</td>
<td align="center" valign="top">3.92</td></tr>
<tr>
<td align="center" valign="top">10</td>
<td align="center" valign="top">3.43</td>
<td align="center" valign="top">4.45</td>
<td align="center" valign="top">4.59</td>
<td align="center" valign="top">5.83</td></tr>
<tr>
<td align="center" valign="top">15</td>
<td align="center" valign="top">4.17</td>
<td align="center" valign="top">5.23</td>
<td align="center" valign="top">5.51</td>
<td align="center" valign="top">7.64</td></tr>
<tr>
<td align="center" valign="top">20</td>
<td align="center" valign="top">5.36</td>
<td align="center" valign="top">6.82</td>
<td align="center" valign="top">6.47</td>
<td align="center" valign="top">9.87</td></tr>
<tr>
<td align="center" valign="top">25</td>
<td align="center" valign="top">6.94</td>
<td align="center" valign="top">8.27</td>
<td align="center" valign="top">8.32</td>
<td align="center" valign="top">11.16</td></tr></tbody></table></table-wrap>
<table-wrap id="t3-sensors-12-02539" position="float">
<label>Table 3.</label>
<caption>
<p>Processing times required for implementing the WCLS algorithm.</p></caption>
<table frame="box" rules="all">
<thead>
<tr>
<th align="center" valign="middle" rowspan="2"><bold>Implementation →</bold></th>
<th align="center" valign="middle"><bold>Processing time (s)</bold></th></tr>
<tr>
<th align="center" valign="middle"><bold>WCLS</bold></th></tr></thead>
<tbody>
<tr>
<td align="center" valign="top">PC-Oriented Implementation of the WCLS</td>
<td align="center" valign="top">12.6</td></tr>
<tr>
<td align="center" valign="top">Implemented with the proposed dual SSA core architecture</td>
<td align="center" valign="top">0.81</td></tr></tbody></table></table-wrap></sec></back></article>
