<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Sensors</journal-id>
<journal-title>Sensors</journal-title>
<issn pub-type="epub">1424-8220</issn>
<publisher>
<publisher-name>Molecular Diversity Preservation International (MDPI)</publisher-name></publisher></journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3390/s120506244</article-id>
<article-id pub-id-type="publisher-id">sensors-12-06244</article-id>
<article-categories>
<subj-group>
<subject>Article</subject></subj-group></article-categories>
<title-group>
<article-title>FPGA Implementation of Generalized Hebbian Algorithm for Texture Classification</article-title></title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Lin</surname><given-names>Shiow-Jyu</given-names></name><xref ref-type="aff" rid="af1-sensors-12-06244"><sup>1</sup></xref></contrib>
<contrib contrib-type="author">
<name><surname>Hwang</surname><given-names>Wen-Jyi</given-names></name><xref ref-type="aff" rid="af2-sensors-12-06244"><sup>2</sup></xref><xref ref-type="corresp" rid="c1-sensors-12-06244"><sup>*</sup></xref></contrib>
<contrib contrib-type="author">
<name><surname>Lee</surname><given-names>Wei-Hao</given-names></name><xref ref-type="aff" rid="af2-sensors-12-06244"><sup>2</sup></xref></contrib></contrib-group>
<aff id="af1-sensors-12-06244">
<label>1</label> Department of Electronic Engineering, National Ilan University, Yilan 260, Taiwan; E-Mail: <email>sjlin@niu.edu.tw</email></aff>
<aff id="af2-sensors-12-06244">
<label>2</label> Department of Computer Science and Information Engineering, National Taiwan Normal University, Taipei 116, Taiwan; E-Mail: <email>699470125@ntnu.edu.tw</email></aff>
<author-notes>
<corresp id="c1-sensors-12-06244">
<label>*</label>Author to whom correspondence should be addressed; E-Mail: <email>whwang@csie.ntnu.edu.tw</email>; Tel.: +886-2-7734-6670; Fax.: +886-2-2932-2378.</corresp></author-notes>
<pub-date pub-type="collection">
<year>2012</year></pub-date>
<pub-date pub-type="epub">
<day>10</day>
<month>05</month>
<year>2012</year></pub-date>
<volume>12</volume>
<issue>5</issue>
<fpage>6244</fpage>
<lpage>6268</lpage>
<history>
<date date-type="received">
<day>26</day>
<month>03</month>
<year>2012</year></date>
<date date-type="rev-recd">
<day>02</day>
<month>05</month>
<year>2012</year></date>
<date date-type="accepted">
<day>02</day>
<month>05</month>
<year>2012</year></date></history>
<permissions>
<copyright-statement>© 2012 by the authors; licensee MDPI, Basel, Switzerland.</copyright-statement>
<copyright-year>2012</copyright-year>
<license>
<p>This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).</p></license></permissions>
<abstract>
<p>This paper presents a novel hardware architecture for principal component analysis. The architecture is based on the Generalized Hebbian Algorithm (GHA) because of its simplicity and effectiveness. The architecture is separated into three portions: the weight vector updating unit, the principal computation unit and the memory unit. In the weight vector updating unit, the computation of different synaptic weight vectors shares the same circuit for reducing the area costs. To show the effectiveness of the circuit, a texture classification system based on the proposed architecture is physically implemented by Field Programmable Gate Array (FPGA). It is embedded in a System-On-Programmable-Chip (SOPC) platform for performance measurement. Experimental results show that the proposed architecture is an efficient design for attaining both high speed performance and low area costs.</p></abstract>
<kwd-group>
<kwd>system on programmable chip</kwd>
<kwd>reconfigurable computing</kwd>
<kwd>principal component analysis</kwd>
<kwd>generalized Hebbian algorithm</kwd>
<kwd>texture classification</kwd>
<kwd>FPGA</kwd></kwd-group></article-meta></front>
<body>
<sec sec-type="intro">
<label>1.</label>
<title>Introduction</title>
<p>Principal Component Analysis (PCA) [<xref ref-type="bibr" rid="b1-sensors-12-06244">1</xref>] plays an important role in pattern recognition, classification, computer vision and data compression [<xref ref-type="bibr" rid="b2-sensors-12-06244">2</xref>,<xref ref-type="bibr" rid="b3-sensors-12-06244">3</xref>]. It is an effective feature extraction technique capable of finding a compact and accurate representation of the data that reduces or eliminates statistically redundant components. Basic PCA implementation involves the Eigen-Value Decomposition (EVD) of the covariance matrix. Long computation time and large storage size are usually required for the EVD. The basic PCA therefore is not suited for online computation on the platforms with limited computation capacity and storage size.</p>
<p>To compute the PCA with reduced computational complexity, a number of fast algorithms [<xref ref-type="bibr" rid="b2-sensors-12-06244">2</xref>,<xref ref-type="bibr" rid="b4-sensors-12-06244">4</xref>–<xref ref-type="bibr" rid="b6-sensors-12-06244">6</xref>] have been proposed. The algorithm presented in [<xref ref-type="bibr" rid="b4-sensors-12-06244">4</xref>] is based on Expectation Maximization (EM). The inverse matrix computation is required in the algorithm, which may be an expensive exercise. Incremental and/or iterative algorithms for PCA computations are proposed in [<xref ref-type="bibr" rid="b2-sensors-12-06244">2</xref>,<xref ref-type="bibr" rid="b5-sensors-12-06244">5</xref>,<xref ref-type="bibr" rid="b6-sensors-12-06244">6</xref>]. A common drawback of these fast PCA methods is that the covariance matrix of training data should be involved. The computation time and storage may still be expensive. Although hardware implementation of PCA is possible, large storage size and complicated circuit control management are usually necessary. The PCA hardware implementation therefore may be used only for data with small dimensions [<xref ref-type="bibr" rid="b7-sensors-12-06244">7</xref>–<xref ref-type="bibr" rid="b9-sensors-12-06244">9</xref>] when limited hardware resource is available. Because of the difficulties for hardware implementation, many PCA-based applications use software for the PCA computation. After the eigenvectors are obtained, only the projection computation is implemented by hardware [<xref ref-type="bibr" rid="b10-sensors-12-06244">10</xref>–<xref ref-type="bibr" rid="b12-sensors-12-06244">12</xref>].</p>
<p>An alternative for the PCA implementation is to use the Generalized Hebbian Algorithm (GHA) [<xref ref-type="bibr" rid="b13-sensors-12-06244">13</xref>,<xref ref-type="bibr" rid="b14-sensors-12-06244">14</xref>]. The GHA is based on an effective incremental updating scheme without the involvement of covariance matrix. The storage requirement for the PCA implementation is then significantly reduced. Nevertheless, slow convergence of the GHA is usually observed. A large number of iterations therefore is required, resulting in long computational time. An effective approach to expedite the GHA training is based on multithreading techniques, which take advantages of all the cores of multicore processors to reduce the computational time. However, multicore processors usually consume large power [<xref ref-type="bibr" rid="b15-sensors-12-06244">15</xref>], and therefore may not be suited for applications requiring low power dissipation.</p>
<p>Analog hardware implementations of GHA [<xref ref-type="bibr" rid="b16-sensors-12-06244">16</xref>,<xref ref-type="bibr" rid="b17-sensors-12-06244">17</xref>] have been found to be a power efficient approach for accelerating the computational speed. However, these architectures are difficult to be directly used for digital devices. A number of digital hardware architectures [<xref ref-type="bibr" rid="b18-sensors-12-06244">18</xref>,<xref ref-type="bibr" rid="b19-sensors-12-06244">19</xref>] have been proposed for expediting the GHA training process. The architecture in [<xref ref-type="bibr" rid="b18-sensors-12-06244">18</xref>] separates the weight vector updating process of GHA into a number of stages for data reuse. Although the architecture has fast computation time, its hardware resource utilization grows linearly with the dimension of data and number of principal components. Therefore, the architecture may not be well suited for data with high vector dimension and/or large number of principal components.</p>
<p>A systolic array with low area costs is proposed in [<xref ref-type="bibr" rid="b19-sensors-12-06244">19</xref>]. The systolic array is based on pixel-wise operations so that the area costs for weight vector updating are independent of vector dimension. Nevertheless, the latency of the architecture increases with the dimension of data. Moreover, similar to the architecture in [<xref ref-type="bibr" rid="b18-sensors-12-06244">18</xref>], the area costs of [<xref ref-type="bibr" rid="b19-sensors-12-06244">19</xref>] grow with the number of principal components. Therefore, the architecture may still have long latency and high area costs.</p>
<p>In light of the facts stated above, a novel GHA implementation capable of performing fast PCA with low power consumption is presented. The implementation is based on Field Programmable Gate Array (FPGA) because it consumes lower power over its multicore counterparts [<xref ref-type="bibr" rid="b20-sensors-12-06244">20</xref>,<xref ref-type="bibr" rid="b21-sensors-12-06244">21</xref>]. As compared with existing FPGA-based architectures for GHA, the proposed architecture has lower area cost and/or lower latency. The proposed architecture can be divided into three parts: the Synaptic Weight Updating (SWU) unit, the Principal Components Computing (PCC) unit, and the memory unit. The memory unit is the on-chip memory storing training vectors and synaptic weight vectors. Based on the data stored in the memory unit, the SWU and PCC units are then used to compute the principal components and update the synaptic weight vectors, respectively.</p>
<p>In the SWU and PCC units, the input training vectors and synaptic weight vectors are separated into a number of non-overlapping blocks for principal component computation and synaptic weight vector updating. Both the SWU and PCC units operate one block at a time. In each unit, the operations of different blocks share the same circuit for reducing the area costs. Moreover, in the SWU unit, the results of precedent weight vectors will be used for the computation of subsequent weight vectors for reducing training time.</p>
<p>To demonstrate the effectiveness of the proposed architecture, a texture classification system on a System-On-Programmable-Chip (SOPC) platform is constructed. The system consists of the proposed architecture, a softcore NIOS II processor [<xref ref-type="bibr" rid="b22-sensors-12-06244">22</xref>], a DMA controller, and a SDRAM. The proposed architecture is adopted for finding the PCA transform by the GHA training, where the training vectors are stored in the SDRAM. The DMA controller is used for the DMA delivery of the training vectors. The softcore processor is only used for coordinating the SOPC system. It does not participate the GHA training process. As compared with its multithreaded software counterpart running on Intel multicore processors, our system has lower computational time and lower power consumption for large training set. All these facts demonstrate the effectiveness of the proposed architecture.</p>
<sec>
<label>2.</label>
<title>Preliminaries</title>
<p><xref ref-type="fig" rid="f1-sensors-12-06244">Figure 1</xref> shows the neural model for GHA, where x(<italic>n</italic>) = [<italic>x</italic><sub>1</sub>(<italic>n</italic>),…,<italic>x<sub>m</sub></italic>(<italic>n</italic>)]<italic><sup>T</sup></italic>, and y(<italic>n</italic>) = [<italic>y</italic><sub>1</sub>(<italic>n</italic>), …,<italic>y<sub>p</sub></italic>(<italic>n</italic>)]<italic><sup>T</sup></italic> are the input and output vectors to the GHA model, respectively. In addition, <italic>m</italic> and <italic>p</italic> are the vector dimension and the number of Principal Components (PCs) for the GHA, respectively. The output vector y(<italic>n</italic>) is related to the input vector x(<italic>n</italic>) by
<disp-formula id="FD1">
<label>(1)</label>
<mml:math id="mm1" display="block">
<mml:semantics id="sm1">
<mml:mrow>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>m</mml:mi></mml:munderover>
<mml:mrow>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:semantics></mml:math></disp-formula>where the <italic>w<sub>ji</sub></italic>(<italic>n</italic>) stands for the weight from the <italic>i</italic>-th synapse to the <italic>j</italic>-th neuron at iteration <italic>n</italic>.</p>
<p>Let
<disp-formula id="FD2">
<label>(2)</label>
<mml:math id="mm2" display="block">
<mml:semantics id="sm2">
<mml:mrow>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mi>j</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mn>1</mml:mn></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>m</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">]</mml:mo></mml:mrow>
<mml:mi>T</mml:mi></mml:msup>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi></mml:mrow></mml:semantics></mml:math></disp-formula>be the <italic>j</italic>-th synaptic weight vector. Each synaptic weight vector w<italic><sub>j</sub></italic>(<italic>n</italic>) is adapted by the Hebbian learning rule:
<disp-formula id="FD3">
<label>(3)</label>
<mml:math id="mm3" display="block">
<mml:semantics id="sm3">
<mml:mrow>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>η</mml:mi>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>j</mml:mi></mml:munderover>
<mml:mrow>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>k</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:semantics></mml:math></disp-formula>where <italic>η</italic> denotes the learning rate. After a large number of iterative computation and adaptation, w<italic><sub>j</sub></italic>(<italic>n</italic>) will asymptotically approach to the eigenvector associated with the <italic>j</italic>-th eigenvalue λ<italic>j</italic> of the covariance matrix of input vectors, where λ<sub>1</sub> &gt; λ<sub>2</sub> &gt; … &gt; λ<italic><sub>p</sub></italic>. To reduce the complexity of computing implementation, <xref ref-type="disp-formula" rid="FD3">Equation (3)</xref> can be rewritten as
<disp-formula id="FD4">
<label>(4)</label>
<mml:math id="mm4" display="block">
<mml:semantics id="sm4">
<mml:mrow>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>η</mml:mi>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>j</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>j</mml:mi></mml:munderover>
<mml:mrow>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>k</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow>
<mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:semantics></mml:math></disp-formula></p>
<p>A more detailed discussion of GHA can be found in [<xref ref-type="bibr" rid="b13-sensors-12-06244">13</xref>,<xref ref-type="bibr" rid="b14-sensors-12-06244">14</xref>]</p></sec></sec>
<sec>
<label>3.</label>
<title>The Proposed GHA Architecture</title>
<p>As shown in <xref ref-type="fig" rid="f2-sensors-12-06244">Figure 2</xref>, the proposed GHA architecture consists of three functional units: the memory unit, the Synaptic Weight Updating (SWU) unit, and the Principal Components Computing (PCC) unit. The memory unit is used for storing the <italic>current</italic> synaptic weight vectors and input vectors. Assume the <italic>current</italic> synaptic weight vectors w<italic><sub>j</sub></italic>(<italic>n</italic>),<italic>j</italic> = 1,…,<italic>p</italic>, are now stored in the memory unit. In addition, the input vector x(<italic>n</italic>) is available. Based on x(<italic>n</italic>) and w<italic><sub>j</sub></italic>(<italic>n</italic>),<italic>j</italic> = 1,…,<italic>p</italic>, the goal of PCC unit is to compute output vector y (<italic>n</italic>). Using x(<italic>n</italic>), y (<italic>n</italic>) and w<sub>j</sub>(<italic>n</italic>),<italic>j =</italic> 1,<italic>…,p</italic>, the SWU unit produces the new synaptic weight vectors w<italic><sub>j</sub></italic>(<italic>n</italic> + 1), <italic>j</italic> = 1,…,<italic>p</italic>. It can be observed from <xref ref-type="fig" rid="f2-sensors-12-06244">Figure 2</xref> that the new synaptic weight vectors will be stored back to the memory unit for subsequent training.</p>
<sec>
<label>3.1.</label>
<title>SWU Unit</title>
<p>The design of SWU unit is based on <xref ref-type="disp-formula" rid="FD4">Equation (4)</xref>. Although the direct implementation of <xref ref-type="disp-formula" rid="FD4">Equation (4)</xref> is possible, it will consume large hardware resources. To further elaborate this fact, we first see from <xref ref-type="disp-formula" rid="FD4">Equation (4)</xref> that the computation of <italic>w<sub>ji</sub></italic>(<italic>n</italic> + 1) and <italic>w<sub>ri</sub></italic>(<italic>n</italic> + 1) shares the same term 
<inline-formula>
<mml:math id="mm5" display="inline">
<mml:semantics id="sm5">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:msubsup>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>r</mml:mi></mml:msubsup>
<mml:mrow>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>k</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:semantics></mml:math></inline-formula> when <italic>r</italic> ≤ <italic>j</italic>. Consequently, independent implementation of <italic>w<sub>ji</sub></italic>(<italic>n+</italic> 1) and <italic>w<sub>ri</sub></italic>(<italic>n</italic>+1) by hardware using <xref ref-type="disp-formula" rid="FD4">Equation (4)</xref> will result in large hardware resource overhead.</p>
<p>To reduce the resource consumption, we first define a vector <italic>z<sub>ji</sub></italic>(<italic>n</italic>) as
<disp-formula id="FD5">
<label>(5)</label>
<mml:math id="mm6" display="block">
<mml:semantics id="sm6">
<mml:mrow>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>j</mml:mi></mml:munderover>
<mml:mrow>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>k</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi></mml:mrow></mml:mrow></mml:semantics></mml:math></disp-formula>and <italic>z<sub>j</sub></italic>(<italic>n</italic>) = [<italic>z<sub>j</sub></italic><sub>1</sub>(<italic>n</italic>), …, <italic>z<sub>jm</sub></italic>(<italic>n</italic>)]<italic><sup>T</sup></italic>. Integrating <xref ref-type="disp-formula" rid="FD4">Equation (4)</xref> and <xref ref-type="disp-formula" rid="FD5">(5)</xref>, we obtain
<disp-formula id="FD6">
<label>(6)</label>
<mml:math id="mm7" display="block">
<mml:semantics id="sm7">
<mml:mrow>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>+</mml:mo>
<mml:mi>η</mml:mi>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>j</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:semantics></mml:math></disp-formula>where <italic>z<sub>ji</sub></italic>(<italic>n</italic>) can be obtained from <italic>z</italic><sub>(</sub><italic><sub>j−</sub></italic><sub>1)</sub><italic><sub>i</sub></italic>(<italic>n</italic>) by
<disp-formula id="FD7">
<label>(7)</label>
<mml:math id="mm8" display="block">
<mml:semantics id="sm8">
<mml:mrow>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>−</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>j</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi></mml:mrow></mml:semantics></mml:math></disp-formula></p>
<p>When <italic>j</italic> = 1, from <xref ref-type="disp-formula" rid="FD5">Equations (5)</xref> and <xref ref-type="disp-formula" rid="FD7">(7)</xref>, it follows that
<disp-formula id="FD8">
<label>(8)</label>
<mml:math id="mm9" display="block">
<mml:semantics id="sm9">
<mml:mrow>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:semantics></mml:math></disp-formula></p>
<p><xref ref-type="fig" rid="f3-sensors-12-06244">Figure 3</xref> depicts the hardware implementation of <xref ref-type="disp-formula" rid="FD6">Equations (6)</xref> and <xref ref-type="disp-formula" rid="FD7">(7)</xref>. As shown in the figure, the SWU unit produces one synaptic weight vector at atime. The computation of w<sub>j</sub>(<italic>n +</italic> 1), the <italic>j</italic>-th weight vector at the iteration <italic>n</italic>+1, requires the z<italic><sub>j</sub></italic><sub>−1</sub>(<italic>n</italic>), y(<italic>n</italic>) and w<italic><sub>j</sub></italic>(<italic>n</italic>) as inputs. In addition to w<sub>j</sub>(<italic>n</italic>+ 1), the SWU unit also produces <italic>z<sub>j</sub></italic>(<italic>n</italic>), which will then be used for the computation of w<italic><sub>j</sub></italic><sub>+1</sub>(<italic>n</italic> + 1). Hardware resource consumption can then be effectively reduced.</p>
<p>One way to implement the SWU unit is to produce w<italic><sub>j</sub></italic>(<italic>n</italic> + 1) and z<italic><sub>j</sub></italic>(<italic>n</italic>) in one shot. However, <italic>m</italic> identical modules, individually shown in <xref ref-type="fig" rid="f4-sensors-12-06244">Figure 4</xref>, may be required because the dimension of vectors is <italic>m</italic>. The area costs of the SWU unit then grow linearly with <italic>m</italic>. To further reduce the area costs, each of the output vectors w<italic><sub>j</sub></italic>(<italic>n</italic> + 1) and z<italic><sub>j</sub></italic>(<italic>n</italic>) is separated into <italic>b</italic> blocks, where each block contains <italic>q</italic> elements. The SWU unit only computes one block of w<italic><sub>j</sub></italic>(<italic>n</italic> + 1) and z<italic><sub>j</sub></italic>(<italic>n</italic>) at a time. Therefore, it will take <italic>b</italic> clock cycles to produce complete w<italic><sub>j</sub></italic>(<italic>n</italic> + 1) and z<italic><sub>j</sub></italic>(<italic>n</italic>).</p>
<p>Let
<disp-formula id="FD9">
<label>(9)</label>
<mml:math id="mm10" display="block">
<mml:semantics id="sm10">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mtext>w</mml:mtext>
<mml:mo>^</mml:mo></mml:mover></mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>q</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">]</mml:mo></mml:mrow>
<mml:mi>T</mml:mi></mml:msup>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi></mml:mrow></mml:semantics></mml:math></disp-formula>and
<disp-formula id="FD10">
<label>(10)</label>
<mml:math id="mm11" display="block">
<mml:semantics id="sm11">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mtext>z</mml:mtext>
<mml:mo>^</mml:mo></mml:mover></mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>z</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>q</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">]</mml:mo></mml:mrow>
<mml:mi>T</mml:mi></mml:msup>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi></mml:mrow></mml:semantics></mml:math></disp-formula>be the <italic>k</italic>-th block of w<italic><sub>j</sub></italic>(<italic>n</italic>) and z<italic><sub>j</sub></italic>(<italic>n</italic>), respectively. The computation w<italic><sub>j</sub></italic>(<italic>n</italic> + 1) and z<italic><sub>j</sub></italic>(<italic>n</italic>) take <italic>b</italic> clock cycles. At the k-th clock cycle, <italic>k =</italic> 1,…, <italic>b</italic>, the SWU unit computes ŵ<italic><sub>j,k</sub></italic>(<italic>n</italic> + 1) and ẑ <italic><sub>j,k</sub></italic>(<italic>n</italic>). Because each of ŵ<italic><sub>j,k</sub></italic>(<italic>n</italic> + 1) and ẑ<italic><sub>j,k</sub></italic>(<italic>n</italic>) contains only <italic>q</italic> elements, the SWU unit consists of <italic>q</italic> identical modules. The architecture of each module is also shown in <xref ref-type="fig" rid="f4-sensors-12-06244">Figure 4</xref>. The SWU unit can be used for GHA with different vector dimension <italic>m</italic>. As <italic>m</italic> increases, the area costs therefore remain the same at the expense of a larger number of clock cycles <italic>b</italic> for the computation of ŵ<italic><sub>j,k</sub></italic>(<italic>n</italic> + 1) and ẑ<italic><sub>j,k</sub></italic>(<italic>n</italic>).</p>
<p>Based on <xref ref-type="disp-formula" rid="FD8">Equation (8)</xref>, the input vector z<sub>0</sub>(<italic>n</italic>) is actually the training vector x(<italic>n</italic>), which is also separated into <italic>b</italic> blocks, where the <italic>k</italic>-th block is given by
<disp-formula id="FD11">
<label>(11)</label>
<mml:math id="mm12" display="block">
<mml:semantics id="sm12">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mtext>z</mml:mtext>
<mml:mo>^</mml:mo></mml:mover></mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>q</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo stretchy="false">]</mml:mo></mml:mrow>
<mml:mi>T</mml:mi></mml:msup>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi></mml:mrow></mml:semantics></mml:math></disp-formula></p>
<p>The ẑ<sub>0</sub><italic><sub>,k</sub></italic>(<italic>n</italic>) and ŵ<sub>1,<italic>k</italic></sub>(<italic>n</italic>), <italic>k</italic> = 1,…, <italic>b</italic>, are used as the input vectors for the computation of ẑ<sub>1,</sub><italic><sub>k</sub></italic>(<italic>n</italic>) and ŵ<sub>1,<italic>k</italic></sub>(<italic>n</italic> + 1), <italic>k</italic> = 1,…,<italic>b</italic>. The z<sub>1</sub>(<italic>n</italic>) and w<sub>1</sub>(<italic>n</italic> + 1) become available when all the ẑ<sub>1,</sub><italic><sub>k</sub></italic>(<italic>n</italic>) and ŵ<sub>1</sub><italic><sub>,k</sub></italic>(<italic>n</italic> + 1), <italic>k</italic> = 1,…,<italic>b</italic>, are obtained. <xref ref-type="fig" rid="f5-sensors-12-06244">Figure 5</xref> shows the computation of ẑ<sub>1,1</sub>(<italic>n</italic>) and ŵ<sub>1,1</sub>(<italic>n</italic> + 1) based on ẑ<sub>0,1</sub>(<italic>n</italic>) and ŵ<sub>1,1</sub>(<italic>n</italic>).</p>
<p>After the computation of w<sub>1</sub>(<italic>n</italic> + 1) and z<sub>1</sub>(<italic>n</italic>) are completed, the vector z<sub>1</sub>(<italic>n</italic>) is then used for the computation of z<sub>2</sub>(<italic>n</italic>) and w<sub>2</sub>(<italic>n</italic> +1). The vector z<sub>2</sub>(<italic>n</italic>) is then used for the computation of w<sub>3</sub>(<italic>n</italic> + 1). The weight vector updating process at the iteration <italic>n</italic> + 1 will not be completed until the SWU unit produces the weight vector w<italic><sub>p</sub></italic>(<italic>n</italic> + 1).</p></sec>
<sec>
<label>3.2.</label>
<title>PCC Unit</title>
<p>The PCC operations are based on <xref ref-type="disp-formula" rid="FD1">Equation (1)</xref>. Therefore, the PCC unit of the proposed architecture contains adders and multipliers. Because the number of multipliers grows with the vector dimension <italic>m</italic>, the direct implementation using <xref ref-type="disp-formula" rid="FD1">Equation (1)</xref> may consume large hardware resources when <italic>m</italic> becomes large. Similar to the SWU unit, the block based computation is used for reducing the area costs. Based on <xref ref-type="disp-formula" rid="FD9">Equations (9)</xref> and <xref ref-type="disp-formula" rid="FD11">(11)</xref>, the <xref ref-type="disp-formula" rid="FD1">Equation (1)</xref> can be rewritten as
<disp-formula id="FD12">
<label>(12)</label>
<mml:math id="mm13" display="block">
<mml:semantics id="sm13">
<mml:mrow>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>j</mml:mi></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>b</mml:mi></mml:munderover>
<mml:mrow>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>q</mml:mi></mml:munderover>
<mml:mrow>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>−</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">)</mml:mo>
<mml:mi>q</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>i</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>,</mml:mo>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo>∑</mml:mo>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>b</mml:mi></mml:munderover>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mover accent="true">
<mml:mtext>w</mml:mtext>
<mml:mo>^</mml:mo></mml:mover></mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi></mml:mrow>
<mml:mi>T</mml:mi></mml:msubsup>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mtext>z</mml:mtext>
<mml:mo>^</mml:mo></mml:mover></mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi></mml:mrow></mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:semantics></mml:math></disp-formula></p>
<p>The implementation of <xref ref-type="disp-formula" rid="FD12">Equation (12)</xref> needs only <italic>q</italic> multipliers, a <italic>q</italic>-input adder, an accumulator, and a <italic>p</italic>-entry buffer, as shown in <xref ref-type="fig" rid="f6-sensors-12-06244">Figure 6</xref>. The multipliers and the <italic>q</italic>-input adder are organized as a <italic>s</italic>-stage pipeline for enhancing the throughput of the circuit.</p>
<p>The blocks ŵ<italic><sub>j,k</sub></italic>(<italic>n</italic>) and ẑ<sub>0</sub><italic><sub>,k</sub></italic>(<italic>n</italic>) are the inputs to the PCC unit. <xref ref-type="fig" rid="f6-sensors-12-06244">Figure 6</xref> also shows the operation of PCC unit when the input vectors are ŵ<italic><sub>j,</sub></italic><sub>1</sub>(<italic>n</italic>) and ẑ<sub>0,1</sub>(<italic>n</italic>). Note that the output of the accumulator in the circuit becomes <italic>y<sub>j</sub></italic>(<italic>n</italic>) only after all the blocks ŵ<italic><sub>j</sub></italic><sub>,k</sub>(<italic>n</italic>) and ẑ<sub>0,</sub><italic><sub>k</sub></italic>(<italic>n</italic>), <italic>k =</italic> 1,…,<italic>b</italic>, have been fetched from the memory unit. The computation of each <italic>y<sub>j</sub></italic>(<italic>n</italic>) therefore takes <italic>b</italic> + <italic>s</italic> cycles. After the computation of <italic>y<sub>j</sub></italic>(<italic>n</italic>) is completed, <italic>y<sub>j</sub></italic>(<italic>n</italic>) will be stored in the <italic>j</italic>-th entry of the buffer for the subsequent computation of w<italic><sub>j</sub></italic>(<italic>n</italic> + 1) in the SWU unit.</p></sec>
<sec>
<label>3.3.</label>
<title>Memory Unit</title>
<p>The memory unit contains three buffers: Buffer A, Buffer B and Buffer C. Buffer A fetches and stores training vector x(<italic>n</italic>) from the main memory. Buffer B contains z<italic><sub>j</sub></italic>(<italic>n</italic>) for the computation in PCC and SWU units. The synaptic weight vectors w<italic><sub>j</sub></italic>(<italic>n</italic>) are stored in Buffer C. All the buffers are shift registers.</p>
<p>To fetch training vector x(<italic>n</italic>) from main memory, the <italic>m</italic> elements in the training vector are interleaved and separated into <italic>q</italic> segments. Each segment contains <italic>b</italic> elements. Therefore, Buffer A is a <italic>q</italic>-stage shift register, where each stage contains <italic>b</italic> cells, as shown in <xref ref-type="fig" rid="f7-sensors-12-06244">Figure 7</xref>. Upon all the <italic>q</italic> segments are received, they are copied to Buffer B as z<sub>0</sub>(<italic>n</italic>).</p>
<p>The architecture of Buffer B is depicted in <xref ref-type="fig" rid="f8-sensors-12-06244">Figure 8</xref>. It holds the values of z<italic><sub>j</sub></italic>(<italic>n</italic>) for the computation in PCC and SWU units. The data in Buffer B is initialized by Buffer A. That is, the initial content of Buffer B is x(<italic>n</italic>) (<italic>i.e.</italic>, z<sub>0</sub>(<italic>n</italic>)). As shown in <xref ref-type="fig" rid="f9-sensors-12-06244">Figure 9</xref>, Buffer B then provides <italic>b</italic> blocks ẑ<sub>0</sub><italic><sub>,k</sub></italic>(<italic>n</italic>), <italic>k</italic> = 1,…,<italic>b</italic>, sequentially to PCC unit for the computation of <italic>y<sub>j</sub></italic>(<italic>n</italic>). Because z<sub>0</sub>(<italic>n</italic>) are used for the operations in PCC and SWU units, all the data output to PCC unit is also rotated back to Buffer B.</p>
<p>After the PCC computation is completed, the Buffer B then delivers data for SWU unit. Starting from z<sub>0</sub>(<italic>n</italic>), the Buffer B provides z<italic><sub>j</sub></italic>(<italic>n</italic>) to SWU unit, and then receives z<sub><italic>j</italic>+1</sub>(<italic>n</italic>) from SWU unit for j = 0,…, <italic>p</italic> − 1. The delivery of z<italic><sub>j</sub></italic>(<italic>n</italic>) and collection of z<sub><italic>j</italic>+1</sub>(<italic>n</italic>) are on a block-by-block basis, as depicted in <xref ref-type="fig" rid="f10-sensors-12-06244">Figure 10</xref>.</p>
<p>The Buffer C contains the synaptic weight vectors w<sub>j</sub>(<italic>n</italic>), j = 1,…,<italic>p</italic>. In addition to providing and storing data for the computation in PCC and SWU units, it also holds the final results after GHA training. <xref ref-type="fig" rid="f11-sensors-12-06244">Figure 11</xref> shows the architecture of Buffer C. Similar to Buffer B, each synaptic weight vectors w<sub>j</sub>(<italic>n</italic>) is divided into b blocks. They are delivered to PCC unit sequentially for the computation of <italic>y<sub>j</sub></italic>(<italic>n</italic>). Moreover, since w<italic><sub>j</sub></italic>(<italic>n</italic>) is also needed for the computation of w<sub>j</sub>(<italic>n</italic> + 1) in the SWU unit, the <italic>b</italic> blocks delivered to the PCC unit should also be rotated back to Buffer C. <xref ref-type="fig" rid="f12-sensors-12-06244">Figure 12</xref> shows the operation of Buffer C for computation in PCC unit.</p>
<p>To support the computation in SWU unit, the Buffer C delivers w<italic><sub>j</sub></italic>(<italic>n</italic>) to SWU unit,and then receives w<italic><sub>j</sub></italic>(<italic>n</italic>+1) from the unit. The delivery of w<italic><sub>j</sub></italic>(<italic>n</italic>) and collection of w<italic><sub>j</sub></italic>(<italic>n</italic>+1) are also on a block-by-block basis, as depicted in <xref ref-type="fig" rid="f13-sensors-12-06244">Figure 13</xref>.</p>
<p>Based on the operations of the memory unit, <xref ref-type="fig" rid="f14-sensors-12-06244">Figure 14</xref> shows the timing diagram of the proposed architecture. It can be observed from the figure that the Buffer A is operated concurrently with Buffers B and C. That is, while the proposed architecture is fetching the training vector x(<italic>n</italic> + 1) to Buffer A, it is also computing y<italic><sub>j</sub></italic>(<italic>n</italic>) and <italic>w<sub>j</sub></italic>(<italic>n</italic>+1) based on x(<italic>n</italic>) and w(<italic>n</italic>). Fetching training vectors may be a time consuming process as vector dimension grows. Therefore, parallel operations of training vector fetching and weight vector computation are beneficial for increasing the GHA training speed.</p></sec>
<sec>
<label>3.4.</label>
<title>SOPC-Based GHA Training System</title>
<p>The proposed architecture is used as a custom user logic in a SOPC system consisting of softcore NIOS CPU [<xref ref-type="bibr" rid="b22-sensors-12-06244">22</xref>], DMA controller and SDRAM, as depicted in <xref ref-type="fig" rid="f15-sensors-12-06244">Figure 15</xref>. All training vectors are stored in the SDRAM and then transported to the proposed circuit via the Avalon bus. The DMA-based training data delivery is performed so that the memory access overhead can be minimized. The softcore NIOS CPU runs on a simple software to support the proposed circuit for GHA training. The software is used only for coordinating different components in the SOPC platform. It does not involve GHA computations. As the delivery of the training vectors is completed, the softcore CPU then retrieves the training results from proposed architecture for subsequent classification operations.</p>
<p><xref ref-type="fig" rid="f16-sensors-12-06244">Figure 16</xref> depicts the interface of the proposed architecture to the SOPC system. The interface consists of an interface buffer for transferring data between the proposed GHA architecture and the SOPC system. The proposed GHA architecture contains a simple controller for accessing the interface. <xref ref-type="fig" rid="f17-sensors-12-06244">Figure 17</xref> depicts the operations of the controller. As shown in <xref ref-type="fig" rid="f17-sensors-12-06244">Figure 17</xref>, the proposed circuit fetches the training vectors from the interface buffer to Buffer A for subsequent processing. In addition, after the completion of training, the synaptic weight vectors in Buffer C are delivered to the interface buffer so that they can be accessed by the NIOS CPU.</p></sec></sec>
<sec sec-type="methods|results">
<label>4.</label>
<title>Performance Analysis and Experimental Results</title>
<p>The area complexities and latency are the major performances considered in this study. Because adders, multipliers and registers are the basic building blocks of the GHA architecture, the area complexities are separated into three categories: the number of adders, the number of multipliers and the number of registers. Given the current synaptic weight vectors w<italic><sub>j</sub></italic>(<italic>n</italic>), <italic>j</italic> = 1,…,<italic>p</italic>, the latency of the proposed GHA architecture is defined as the time required to produce the new synaptic weight vectors w<italic><sub>j</sub></italic>(<italic>n</italic> + 1), j = 1,…,<italic>p</italic>.</p>
<p><xref ref-type="table" rid="t1-sensors-12-06244">Table 1</xref> shows the area complexities and latency of various architectures for GHA training. It can be observed from the table that the number of adders and multipliers of the proposed architecture are independent of the vector dimension <italic>m</italic> and the number of principal components <italic>p</italic>. By contrast, the area costs of [<xref ref-type="bibr" rid="b18-sensors-12-06244">18</xref>] grow with both <italic>m</italic> and <italic>p</italic>. We can also see from the table that the latency of [<xref ref-type="bibr" rid="b19-sensors-12-06244">19</xref>] increases with both <italic>m</italic> and <italic>p</italic>. Based on the timing diagram shown in <xref ref-type="fig" rid="f14-sensors-12-06244">Figure 14</xref>, the latency of the proposed architecture is <italic>max</italic>(<italic>q</italic>, 2<italic>bp</italic> + <italic>s</italic>). Therefore, it is independent of vector dimension <italic>m</italic>. The proposed architecture is then well suited for applications requiring large vector dimension <italic>m</italic>.</p>
<p>Next we consider the physical implementation of the proposed architecture. The design platform is Altera Quartus II with SOPC Builder [<xref ref-type="bibr" rid="b23-sensors-12-06244">23</xref>] and NIOS II IDE. <xref ref-type="table" rid="t2-sensors-12-06244">Table 2</xref> show the hardware resource consumption of the proposed architecture for vector dimensions <italic>m</italic> = 16 × 16 and <italic>m</italic> = 32 × 32, respectively. The hardware resource utilization of the entire SOPC systems is revealed in <xref ref-type="table" rid="t3-sensors-12-06244">Table 3</xref>. In order to maintain low area cost, we use fixed-point format to represent data. The length of the format is signed 8 bits. The target FPGA device is Altera Cyclone IV EP4CGX150DF31C7. The number of modules <italic>q</italic> is 64 for all the implementations shown in the tables.</p>
<p>Three different area resources are considered in the tables: Logic Elements (LEs), embedded memory bits, and embedded multipliers. The LEs are used for the implementation of adders, multipliers and registers in the proposed GHA architecture. Both the LEs and embedded memory bits are also used for the implementation of NIOS CPU of the SOPC system. The embedded multipliers are used for the implementation of the multipliers of the proposed GHA architecture.</p>
<p>It can be observed from <xref ref-type="table" rid="t2-sensors-12-06244">Tables 2</xref> and <xref ref-type="table" rid="t3-sensors-12-06244">3</xref> that the consumption of embedded multiplier of the proposed architecture is independent of the vector dimension <italic>m</italic> and number of principal components <italic>p</italic>. Because the embedded multipliers are used only for the implementation of multiplier in the proposed architecture, they are dependent only on <italic>q</italic>. In the experiment, all the implementations in <xref ref-type="table" rid="t2-sensors-12-06244">Tables 2</xref> and <xref ref-type="table" rid="t3-sensors-12-06244">3</xref> have the same <italic>q</italic>. Therefore, all the implementations utilize the same number of embedded multipliers.</p>
<p>Because the embedded memory bits are mainly used only for the realization of NIOS CPU, the consumption of embedded memory bits are also independent of <italic>m</italic> and <italic>p</italic>, as shown in <xref ref-type="table" rid="t2-sensors-12-06244">Tables 2</xref> and <xref ref-type="table" rid="t3-sensors-12-06244">3</xref>. It can be observed from the tables that the consumption of LEs grows with <italic>m</italic> and <italic>p</italic>. It is not surprising because the LEs are used to design the registers. Moreover, the number of registers increases with m and <italic>p</italic>, as shown in <xref ref-type="table" rid="t1-sensors-12-06244">Table 1</xref>. Therefore, the numerical results shown in <xref ref-type="table" rid="t2-sensors-12-06244">Tables 2</xref> and <xref ref-type="table" rid="t3-sensors-12-06244">3</xref> are consistent with the analytical results in <xref ref-type="table" rid="t1-sensors-12-06244">Table 1</xref>.</p>
<p><xref ref-type="fig" rid="f18-sensors-12-06244">Figures 18</xref> and <xref ref-type="fig" rid="f19-sensors-12-06244">19</xref> show the Classification Success Rate (CSR) distribution of the proposed architecture for the textures shown in <xref ref-type="fig" rid="f20-sensors-12-06244">Figures 20</xref> and <xref ref-type="fig" rid="f21-sensors-12-06244">21</xref>, respectively. The CSR is defined as the number of test vectors which are successfully classified divided by the total number of test vectors. The number of principal components is <italic>p</italic> = 4. The vector dimensions are <italic>m</italic> = 16×16 and 32×32. The distribution for each vector dimension is based on 20 independent GHA training processes. The CSR distribution of the architecture presented in [<xref ref-type="bibr" rid="b18-sensors-12-06244">18</xref>] with the same <italic>p</italic> is also included for comparison purpose. The vector dimension for [<xref ref-type="bibr" rid="b18-sensors-12-06244">18</xref>] is <italic>m</italic> = 4 × 4.</p>
<p>The size of each texture in <xref ref-type="fig" rid="f20-sensors-12-06244">Figures 20</xref> and <xref ref-type="fig" rid="f21-sensors-12-06244">21</xref> is 576×576. In the experiment, the Principal Component based <italic>k</italic> Nearest Neighbor (PC-<italic>k</italic>NN) rule is adopted for texture classification. Two steps are involved in the PC-<italic>k</italic>NN rule. In the first step, the GHA is applied to the input vectors to transform <italic>m</italic> dimensional data into <italic>p</italic> principal components. The synaptic weight vectors after the convergence of GHA training are adopted to span the linear transformation matrix. In the second step, the <italic>k</italic>NN method is applied to the principal subspace for texture classification.</p>
<p>It can be observed from <xref ref-type="fig" rid="f18-sensors-12-06244">Figures 18</xref> and <xref ref-type="fig" rid="f19-sensors-12-06244">19</xref> that the proposed architecture has better CSR. This is because the vector dimensions of the proposed architecture are higher than those in [<xref ref-type="bibr" rid="b18-sensors-12-06244">18</xref>]. Spatial information of textures therefore can be effectively exploited. The proposed architecture is able to implement the hardware GHA training with vector dimension up to <italic>m</italic> = 32 × 32. The hardware realization for <italic>m</italic> = 32 × 32 is possible because the area costs of the SWU and PCC units in the proposed architecture are independent of vector dimension. By contrast, the area costs of the SWU and PCC units in [<xref ref-type="bibr" rid="b18-sensors-12-06244">18</xref>] grow with the vector dimension. Therefore, only smaller vector dimension (<italic>i.e., m</italic> = 4 × 4) can be implemented.</p>
<p>Although the proposed architecture is based on signed 8-bit fixed point format, the degradation in CSR is small as compared with the GHA without truncation. <xref ref-type="fig" rid="f22-sensors-12-06244">Figure 22</xref> reveals the truncation effects of the proposed architecture. The GHA implementation without truncation is implemented by software with floating-point format. The training images for this experiment is shown in <xref ref-type="fig" rid="f20-sensors-12-06244">Figure 20</xref>. The vector dimension is 32 × 32. The distribution for each format is based on 20 independent GHA training processes. It can be observed from <xref ref-type="fig" rid="f22-sensors-12-06244">Figure 22</xref> that only a slight decrease in CSR is observed for the fixed-point format. In fact, the average CSR degradation is only 3.44% (from average CSR 95.53% for floating-point format to 92.09% for fixed point format).</p>
<p>Another advantage of the proposed architecture is its superior computational capacity for GHA training. <xref ref-type="fig" rid="f23-sensors-12-06244">Figure 23</xref> shows the CPU time of the NIOS-based SOPC system using the proposed architecture as a hardware accelerator for various numbers of training iterations with <italic>m</italic> = 16 × 16 and <italic>p</italic> = 7. The NIOS CPU clock rate in the system is 50 MHz. The target FPGA for the implementation is Cyclone III EP3C120F780C8. The CPU time of the software counterparts running on the general purpose 1.6 GHz Intel i5 and 2.8 GHz Intel i7 processors also are depicted in the <xref ref-type="fig" rid="f23-sensors-12-06244">Figure 23</xref> for comparison purpose. The software implementations are multithreaded to take advantages of all the cores in the processors. There are 16 threads in the codes: 8 threads for synaptic weight updating, and 8 threads for the principal component computation and others. An optimizing compiler (offered by Visual Studio) is used to further enhance the computational speed. It can be clearly observed from <xref ref-type="fig" rid="f23-sensors-12-06244">Figure 23</xref> that the proposed architecture attains high speed up over its software counterparts. In particular, when the number of training iterations reaches 1000, the CPU time of the proposed SOPC system is 733.14 ms. By contrast, the CPU time of Intel i7 is 1,0125.37 ms. The speedup of proposed architecture over the software counterpart is therefore 13.81.</p>
<p>The proposed architecture has superior speed performance over its software counterparts because there are limitations for exploiting the thread level parallelism. The GHA is an incremental training algorithm. Therefore, it is difficult to exploit parallelism among the computations for different training vectors. The inherent data dependency among different GHA stages (e.g., between principal component computation and weight vector updating) may slow down the computation speed due to costly data forwarding via shared memory. Moreover, the inputs (<italic>i.e.</italic>, x(<italic>n</italic>) and w<sub><italic>j</italic></sub>(<italic>n</italic>), <italic>j</italic> = 1,…,<italic>p</italic>) and outputs (<italic>i.e.</italic>, y(<italic>n</italic>), w<italic><sub>j</sub></italic>(<italic>n</italic> + 1), <italic>j</italic> = 1,…,<italic>p</italic>) of the algorithms are all vectors with large dimension. Large number of memory accesses required by GHA is another limiting factor for performance enhancement of software implementations. By contrast, the proposed architecture is able to perform data forwarding and memory accesses in an efficient manner. The employment of Buffers A, B and C allows the parallel operations of training vector fetching and weight vector computation. The latency for memory access can then be concealed. Moreover, the Buffers B and C are also designed for fast data forwarding between principal computation and weight vector updating without complicated memory management and external memory accesses.</p>
<p>In addition to having superior computational speed, the proposed architecture consumes lower power. <xref ref-type="table" rid="t4-sensors-12-06244">Table 4</xref> shows the power consumption of various GHA implementations. For the power estimation of GHA software implementations, the tool Joulemeter (developed by Microsoft Research) [<xref ref-type="bibr" rid="b24-sensors-12-06244">24</xref>] is used. The tool is able to estimate the power consumed by CPU for a specific application. The power consumption of other parts of a computer such as main memory and monitor therefore can be excluded for comparisons. The power consumed by the proposed architecture is estimated by the PowerPlay Power Analyzer Tool [<xref ref-type="bibr" rid="b25-sensors-12-06244">25</xref>] provided by Altera. From <xref ref-type="table" rid="t4-sensors-12-06244">Table 4</xref>, it can be observed that the power consumption of the proposed architecture is only 0.4% of that of Intel I7 processor for GHA training (<italic>i.e.</italic>, 0.129 W<italic>versus</italic> 31.656 W). As compared with the low power multicore processor Intel i5 for laptop computers, the proposed architecture also has significantly lower power dissipation (<italic>i.e.</italic>, 0.129 W<italic>versus</italic> 1.292 W).</p>
<p><xref ref-type="table" rid="t5-sensors-12-06244">Table 5</xref> compares the computation speed of various GHA architectures implemented by FPGA. Similar to <xref ref-type="fig" rid="f23-sensors-12-06244">Figure 23</xref>, the computation time of the proposed architecture is measured as the CPU time of the NIOS processor using the proposed architecture as the hardware accelerator. The clock rate of NIOS CPU in the system is 100 MHz. The vector dimension and the number of principal components associated with the proposed architecture are <italic>m</italic> = 16 × 16 and <italic>p</italic> = 16, respectively. The computation time of architectures in [<xref ref-type="bibr" rid="b18-sensors-12-06244">18</xref>,<xref ref-type="bibr" rid="b19-sensors-12-06244">19</xref>] with different <italic>m</italic> and/or <italic>p</italic> values are also included in the table.</p>
<p>Note that direct comparisons of these architectures may be difficult because the speed of these architectures are measured on different FPGA devices with different <italic>m, p</italic> and/or clock rates. To show the superiority of the proposed architecture, the comparisons are based on the same training size (<italic>i.e.</italic>, number of training vectors per iteration) and number of iterations. With larger vector dimension (<italic>i.e.</italic>, 16 × 16 <italic>versus</italic> 16 × 8), slower clock rate (<italic>i.e.</italic>, 100 MHz <italic>versus</italic> 136.243 M Hz), and the same number of principal components (<italic>i.e., p</italic> = 16), it can be observed from <xref ref-type="table" rid="t5-sensors-12-06244">Table 5</xref> that the proposed architecture still has faster computation speed as compared with the architecture in [<xref ref-type="bibr" rid="b19-sensors-12-06244">19</xref>]. Although the architecture in [<xref ref-type="bibr" rid="b18-sensors-12-06244">18</xref>] has fastest computation time, the architecture is suitable only for small vector dimension (<italic>i.e., m</italic> = 4 × 4) and small number of principal components (<italic>i.e., p</italic> = 4). All these facts demonstrate the effectiveness of the proposed architecture.</p></sec>
<sec>
<label>5.</label>
<title>Concluding Remarks</title>
<p>Experimental results reveal that the proposed GHA architecture has superior speed performance over its software counterparts and other GHA architectures. With lower clock rate and higher vector dimension, the proposed architecture still has faster computation speed over the architecture in [<xref ref-type="bibr" rid="b19-sensors-12-06244">19</xref>]. In addition, the architecture is able to attain higher CSR for texture classification as compared with other GHA architectures. In fact, all the CSRs are above 90% for all the experiments considered in this paper. The proposed architecture also has low area costs for fast PCA analysis with high vector dimension up to <italic>m</italic> = 32 × 32. The utilization of memory bits and embedded multipliers for FPGA implementation are independent of the vector dimension and the number of principal components. The proposed architecture therefore is an effective alternative for on-chip learning applications requiring low area costs, high classification success rate and high speed computation.</p></sec></body>
<back>
<ref-list>
<title>References</title>
<ref id="b1-sensors-12-06244"><label>1.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Jolliffe</surname><given-names>I.T.</given-names></name></person-group><source>Principal Component Analysis</source><edition>2nd ed.</edition><publisher-name>Springer</publisher-name><publisher-loc>Berlin, Heidelberg, Germany</publisher-loc><year>2002</year></citation></ref>
<ref id="b2-sensors-12-06244"><label>2.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dagher</surname><given-names>I.</given-names></name><name><surname>Nachar</surname><given-names>R.</given-names></name></person-group><article-title>Face recognition using (incremental) IPCA-ICA algorithm</article-title><source>IEEE Trans. Pattern Anal. Mach. Intell.</source><year>2006</year><volume>28</volume><fpage>996</fpage><lpage>1000</lpage><pub-id pub-id-type="doi">10.1109/TPAMI.2006.118</pub-id><pub-id pub-id-type="pmid">16724592</pub-id></citation></ref>
<ref id="b3-sensors-12-06244"><label>3.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname><given-names>K.</given-names></name><name><surname>Franz</surname><given-names>M.O.</given-names></name><name><surname>Scholkopf</surname><given-names>B.</given-names></name></person-group><article-title>Iterative kernel principal component analysis for image modeling</article-title><source>IEEE Trans. Pattern Anal. Mach. Intell.</source><year>2005</year><volume>27</volume><fpage>1351</fpage><lpage>1366</lpage><pub-id pub-id-type="doi">10.1109/TPAMI.2005.181</pub-id><pub-id pub-id-type="pmid">16173181</pub-id></citation></ref>
<ref id="b4-sensors-12-06244"><label>4.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Roweis</surname><given-names>S.</given-names></name></person-group><article-title>EM algorithms for PCA and SPCA</article-title><source>Adv. Neural Inf. Process. Syst.</source><year>1998</year><volume>10</volume><fpage>626</fpage><lpage>632</lpage></citation></ref>
<ref id="b5-sensors-12-06244"><label>5.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Sajid</surname><given-names>I.</given-names></name><name><surname>Ahmed</surname><given-names>M.M.</given-names></name><name><surname>Taj</surname><given-names>I.</given-names></name></person-group><article-title>Design and Implementation of a Face Recognition System Using Fast PCA</article-title><conf-name>Proceedings of the IEEE International Symposium on Computer Science and its Applications</conf-name><conf-loc>Hobart, Australia</conf-loc><conf-date>13–15 October 2008</conf-date><fpage>126</fpage><lpage>130</lpage></citation></ref>
<ref id="b6-sensors-12-06244"><label>6.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sharma</surname><given-names>A.</given-names></name><name><surname>Paliwal</surname><given-names>K.K.</given-names></name></person-group><article-title>Fast principal component analysis using fixed-point algorithm</article-title><source>Pattern Recogn. Lett.</source><year>2007</year><volume>28</volume><fpage>1151</fpage><lpage>1155</lpage><pub-id pub-id-type="doi">10.1016/j.patrec.2007.01.012</pub-id></citation></ref>
<ref id="b7-sensors-12-06244"><label>7.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Boonkumklao</surname><given-names>W.</given-names></name><name><surname>Miyanaga</surname><given-names>Y.</given-names></name><name><surname>Dejhan</surname><given-names>K.</given-names></name></person-group><article-title>Flexible PCA Architecture Realized on FPGA</article-title><conf-name>Proceedings of the International Symposium on Communications and Information Technologies</conf-name><conf-loc>ChiangMai, Thailand</conf-loc><conf-date>14–16 November 2001</conf-date><fpage>590</fpage><lpage>593</lpage></citation></ref>
<ref id="b8-sensors-12-06244"><label>8.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Chen</surname><given-names>T.-C.</given-names></name><name><surname>Liu</surname><given-names>W.</given-names></name><name><surname>Chen</surname><given-names>L.-G.</given-names></name></person-group><article-title>VLSI Architecture of Leading Eigenvector Generation for On-Chip Principal Component Analysis Spike Sorting System</article-title><conf-name>Proceedings of the 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society</conf-name><conf-loc>Vancouver, BC, Canada</conf-loc><conf-date>20–24 August 2008</conf-date><fpage>3192</fpage><lpage>3195</lpage></citation></ref>
<ref id="b9-sensors-12-06244"><label>9.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname><given-names>D.</given-names></name><name><surname>Han</surname><given-names>J.-Q.</given-names></name></person-group><article-title>An FPGA-based face recognition using combined 5/3 DWT with PCA methods</article-title><source>J. Commun. Comput.</source><year>2009</year><volume>6</volume><fpage>1</fpage><lpage>8</lpage></citation></ref>
<ref id="b10-sensors-12-06244"><label>10.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Ngo</surname><given-names>H.T.</given-names></name><name><surname>Rajkiran</surname><given-names>G.</given-names></name><name><surname>Asari</surname><given-names>V.K.</given-names></name></person-group><article-title>A Flexible and Efficient Hardware Architecture for Real-Time Face Recognition Based on Eigenface</article-title><conf-name>Proceedings of the IEEE Computer Society Annual Symposium on VLSI</conf-name><conf-loc>Tampa, FL, USA</conf-loc><conf-date>11–12 May 2005</conf-date><fpage>280</fpage><lpage>281</lpage></citation></ref>
<ref id="b11-sensors-12-06244"><label>11.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gottumukkal</surname><given-names>R.</given-names></name><name><surname>Ngo</surname><given-names>H.T.</given-names></name><name><surname>Asari</surname><given-names>V.K.</given-names></name></person-group><article-title>Multi-lane architecture for eigenface based real-time face recognition</article-title><source>Microprocess. Microsyst.</source><year>2006</year><volume>30</volume><fpage>216</fpage><lpage>224</lpage><pub-id pub-id-type="doi">10.1016/j.micpro.2005.07.003</pub-id></citation></ref>
<ref id="b12-sensors-12-06244"><label>12.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pavan Kumar</surname><given-names>A.</given-names></name><name><surname>Kamakoti</surname><given-names>V.</given-names></name><name><surname>Das</surname><given-names>S.</given-names></name></person-group><article-title>System-on-programmable-chip implementation for on-line face recognition</article-title><source>Pattern Recogn. Lett.</source><year>2007</year><volume>28</volume><fpage>342</fpage><lpage>349</lpage><pub-id pub-id-type="doi">10.1016/j.patrec.2006.04.006</pub-id></citation></ref>
<ref id="b13-sensors-12-06244"><label>13.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Haykin</surname><given-names>S.</given-names></name></person-group><source>Neural Networks and Learning Machines</source><edition>3rd ed.</edition><publisher-name>Pearson</publisher-name><publisher-loc>Upper Saddle River, NJ, USA</publisher-loc><year>2009</year></citation></ref>
<ref id="b14-sensors-12-06244"><label>14.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sanger</surname><given-names>T.D.</given-names></name></person-group><article-title>Optimal unsupervised learning in a single-layer linear feedforward neural network</article-title><source>Neural Netw.</source><year>1989</year><volume>12</volume><fpage>459</fpage><lpage>473</lpage></citation></ref>
<ref id="b15-sensors-12-06244"><label>15.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Blake</surname><given-names>G.</given-names></name><name><surname>Dreslinski</surname><given-names>R.G.</given-names></name><name><surname>Mudge</surname><given-names>T.</given-names></name></person-group><article-title>A survey of multicore processors</article-title><source>IEEE Signal Process. Mag.</source><year>2009</year><volume>26</volume><fpage>26</fpage><lpage>37</lpage></citation></ref>
<ref id="b16-sensors-12-06244"><label>16.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Carvajal</surname><given-names>G.</given-names></name><name><surname>Valenzuela</surname><given-names>W.</given-names></name><name><surname>Figueroa</surname><given-names>M.</given-names></name></person-group><article-title>Subspace-Based Face Recognition in Analog VLSI</article-title><source>Advances in Neural Information Processing Systems</source><publisher-name>MIT Press</publisher-name><publisher-loc>Cambridge, UK</publisher-loc><year>2008</year><fpage>225</fpage><lpage>232</lpage></citation></ref>
<ref id="b17-sensors-12-06244"><label>17.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Carvajal</surname><given-names>G.</given-names></name><name><surname>Valenzuela</surname><given-names>W.</given-names></name><name><surname>Figueroa</surname><given-names>M.</given-names></name></person-group><article-title>Image recognition in analog VLSI with on-chip learning</article-title><source>Artif. Neural Netw.</source><year>2009</year><volume>5768</volume><fpage>428</fpage><lpage>438</lpage></citation></ref>
<ref id="b18-sensors-12-06244"><label>18.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname><given-names>S.J.</given-names></name><name><surname>Hung</surname><given-names>Y.T.</given-names></name><name><surname>Hwang</surname><given-names>W.J.</given-names></name></person-group><article-title>Efficient hardware architecture based on generalized Hebbian algorithm for texture classification</article-title><source>Neurocomputing</source><year>2011</year><volume>74</volume><fpage>3248</fpage><lpage>3256</lpage><pub-id pub-id-type="doi">10.1016/j.neucom.2011.05.010</pub-id></citation></ref>
<ref id="b19-sensors-12-06244"><label>19.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sudha</surname><given-names>N.</given-names></name><name><surname>Mohan</surname><given-names>A.R.</given-names></name><name><surname>Meher</surname><given-names>P.K.</given-names></name></person-group><article-title>A self-configurable systolic architecture for face recognition system based on principal component neural network</article-title><source>IEEE Trans. Circuits Syst. Video Technol.</source><year>2011</year><volume>21</volume><fpage>1071</fpage><lpage>1084</lpage><pub-id pub-id-type="doi">10.1109/TCSVT.2011.2133210</pub-id></citation></ref>
<ref id="b20-sensors-12-06244"><label>20.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Fowers</surname><given-names>J.</given-names></name><name><surname>Brown</surname><given-names>G.</given-names></name><name><surname>Cooke</surname><given-names>P.</given-names></name><name><surname>Stitt</surname><given-names>G.</given-names></name></person-group><article-title>A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications</article-title><conf-name>Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays</conf-name><conf-loc>Monterey, CA, USA</conf-loc><conf-date>22–24 February 2012</conf-date><fpage>47</fpage><lpage>56</lpage></citation></ref>
<ref id="b21-sensors-12-06244"><label>21.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pauwels</surname><given-names>K.</given-names></name><name><surname>Tomasi</surname><given-names>M.</given-names></name><name><surname>Diaz Alonso</surname><given-names>J.</given-names></name><name><surname>Ros</surname><given-names>E.</given-names></name><name><surname>van Hulle</surname><given-names>M.</given-names></name></person-group><article-title>A comparison of FPGA and GPU for real-time phase-based optical flow, stereo, and local image features</article-title><source>IEEE Trans. Comput.</source><year>2012</year><comment>in press</comment></citation></ref>
<ref id="b22-sensors-12-06244"><label>22.</label><citation citation-type="web"><person-group person-group-type="author"><collab>Altera Corporation</collab></person-group><source>NIOS II Processor Reference Handbook Ver 10.0</source><year>2010</year><comment>Available online: <ext-link xlink:href="http://www.altera.com/literature/lit-nio2.jsp" ext-link-type="uri">http://www.altera.com/literature/lit-nio2.jsp</ext-link> (accessed on 3 May 2012)</comment></citation></ref>
<ref id="b23-sensors-12-06244"><label>23.</label><citation citation-type="web"><person-group person-group-type="author"><collab>Altera Corporation</collab></person-group><source>SOPC Builder User Guide</source><year>2011</year><comment>Available online: <ext-link xlink:href="http://www.altera.com/literature/lit-sop.jsp" ext-link-type="uri">http://www.altera.com/literature/lit-sop.jsp</ext-link> (accessed on 3 May 2012)</comment></citation></ref>
<ref id="b24-sensors-12-06244"><label>24.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Kansal</surname><given-names>A.</given-names></name><name><surname>Zhao</surname><given-names>F.</given-names></name><name><surname>Liu</surname><given-names>J.</given-names></name><name><surname>Kothari</surname><given-names>N.</given-names></name><name><surname>Bhattacharya</surname><given-names>A.</given-names></name></person-group><article-title>Virtual Machine Power Metering and Provisioning</article-title><conf-name>Proceedings of the ACM Symposium on Cloud Computing</conf-name><conf-loc>Indianapolis, IN, USA</conf-loc><conf-date>10–11 June 2010</conf-date></citation></ref>
<ref id="b25-sensors-12-06244"><label>25.</label><citation citation-type="web"><person-group person-group-type="author"><collab>Altera Corporation</collab></person-group><source>Quartus II Handbook Ver 11.1</source><year>2011</year><volume>3</volume><comment>Available online: <ext-link xlink:href="http://www.altera.com/literature/lit-qts.jsp" ext-link-type="uri">http://www.altera.com/literature/lit-qts.jsp</ext-link> (accessed on 3 May 2012)</comment></citation></ref></ref-list>
<sec sec-type="display-objects">
<title>Figures and Tables</title>
<fig id="f1-sensors-12-06244" position="float">
<label>Figure 1.</label>
<caption>
<p>The neural model for the GHA.</p></caption>
<graphic xlink:href="sensors-12-06244f1.gif"/></fig>
<fig id="f2-sensors-12-06244" position="float">
<label>Figure 2.</label>
<caption>
<p>The proposed GHA architecture.</p></caption>
<graphic xlink:href="sensors-12-06244f2.gif"/></fig>
<fig id="f3-sensors-12-06244" position="float">
<label>Figure 3.</label>
<caption>
<p>The hardware implementation of <xref ref-type="disp-formula" rid="FD6">Equations (6)</xref> and <xref ref-type="disp-formula" rid="FD7">(7)</xref>.</p></caption>
<graphic xlink:href="sensors-12-06244f3.gif"/></fig>
<fig id="f4-sensors-12-06244" position="float">
<label>Figure 4.</label>
<caption>
<p>The architecture of each module in the SWU unit.</p></caption>
<graphic xlink:href="sensors-12-06244f4.gif"/></fig>
<fig id="f5-sensors-12-06244" position="float">
<label>Figure 5.</label>
<caption>
<p>The SWU unit operation for computing the first segment of w<sub>1</sub>(<italic>n</italic>+ 1).</p></caption>
<graphic xlink:href="sensors-12-06244f5.gif"/></fig>
<fig id="f6-sensors-12-06244" position="float">
<label>Figure 6.</label>
<caption>
<p>The PCC unit architecture.</p></caption>
<graphic xlink:href="sensors-12-06244f6.gif"/></fig>
<fig id="f7-sensors-12-06244" position="float">
<label>Figure 7.</label>
<caption>
<p>The Buffer A architecture in memory unit.</p></caption>
<graphic xlink:href="sensors-12-06244f7.gif"/></fig>
<fig id="f8-sensors-12-06244" position="float">
<label>Figure 8.</label>
<caption>
<p>The Buffer B architecture in memory unit.</p></caption>
<graphic xlink:href="sensors-12-06244f8.gif"/></fig>
<fig id="f9-sensors-12-06244" position="float">
<label>Figure 9.</label>
<caption>
<p>The Buffer B operation for the PCC unit.</p></caption>
<graphic xlink:href="sensors-12-06244f9.gif"/></fig>
<fig id="f10-sensors-12-06244" position="float">
<label>Figure 10.</label>
<caption>
<p>The Buffer B operation for the SWU unit.</p></caption>
<graphic xlink:href="sensors-12-06244f10.gif"/></fig>
<fig id="f11-sensors-12-06244" position="float">
<label>Figure 11.</label>
<caption>
<p>The Buffer C architecture.</p></caption>
<graphic xlink:href="sensors-12-06244f11.gif"/></fig>
<fig id="f12-sensors-12-06244" position="float">
<label>Figure 12.</label>
<caption>
<p>The Buffer C operation for the PCC unit.</p></caption>
<graphic xlink:href="sensors-12-06244f12.gif"/></fig>
<fig id="f13-sensors-12-06244" position="float">
<label>Figure 13.</label>
<caption>
<p>The Buffer C operation for the SWU unit.</p></caption>
<graphic xlink:href="sensors-12-06244f13.gif"/></fig>
<fig id="f14-sensors-12-06244" position="float">
<label>Figure 14.</label>
<caption>
<p>The timing diagram for the operations of the proposed architecture: (<bold>a</bold>) <italic>q</italic> &gt; 2<italic>bp</italic> + <italic>s</italic>; (<bold>b</bold>) <italic>q</italic> &lt; 2<italic>bp</italic> + <italic>s</italic>.</p></caption>
<graphic xlink:href="sensors-12-06244f14.gif"/></fig>
<fig id="f15-sensors-12-06244" position="float">
<label>Figure 15.</label>
<caption>
<p>The SOPC system for implementing GHA.</p></caption>
<graphic xlink:href="sensors-12-06244f15.gif"/></fig>
<fig id="f16-sensors-12-06244" position="float">
<label>Figure 16.</label>
<caption>
<p>The interface of the proposed architecture to the SOPC system.</p></caption>
<graphic xlink:href="sensors-12-06244f16.gif"/></fig>
<fig id="f17-sensors-12-06244" position="float">
<label>Figure 17.</label>
<caption>
<p>The operation of the controller of the proposed architecture.</p></caption>
<graphic xlink:href="sensors-12-06244f17.gif"/></fig>
<fig id="f18-sensors-12-06244" position="float">
<label>Figure 18.</label>
<caption>
<p>The CSR distributions of the proposed architecture for the texture set shown in <xref ref-type="fig" rid="f20-sensors-12-06244">Figure 20</xref>.</p></caption>
<graphic xlink:href="sensors-12-06244f18.gif"/></fig>
<fig id="f19-sensors-12-06244" position="float">
<label>Figure 19.</label>
<caption>
<p>The CSR distributions of the proposed architecture for the texture set shown in <xref ref-type="fig" rid="f21-sensors-12-06244">Figure 21</xref>.</p></caption>
<graphic xlink:href="sensors-12-06244f19.gif"/></fig>
<fig id="f20-sensors-12-06244" position="float">
<label>Figure 20.</label>
<caption>
<p>The set of textures for CSR measurements in <xref ref-type="fig" rid="f18-sensors-12-06244">Figure 18</xref>.</p></caption>
<graphic xlink:href="sensors-12-06244f20.gif"/></fig>
<fig id="f21-sensors-12-06244" position="float">
<label>Figure 21.</label>
<caption>
<p>The set of textures for CSR measurements in <xref ref-type="fig" rid="f19-sensors-12-06244">Figure 19</xref>.</p></caption>
<graphic xlink:href="sensors-12-06244f21.gif"/></fig>
<fig id="f22-sensors-12-06244" position="float">
<label>Figure 22.</label>
<caption>
<p>The CSR distribution of GHA with fixed and floating point format.</p></caption>
<graphic xlink:href="sensors-12-06244f22.gif"/></fig>
<fig id="f23-sensors-12-06244" position="float">
<label>Figure 23.</label>
<caption>
<p>The CPU time of the NIOS-based SOPC system using the proposed architecture as the hardware accelerator for various numbers of training iterations with <italic>m</italic> = 16 × 16 and <italic>p</italic> = 7.</p></caption>
<graphic xlink:href="sensors-12-06244f23.gif"/></fig>
<table-wrap id="t1-sensors-12-06244" position="float">
<label>Table 1.</label>
<caption>
<p>Performance analysis of various architectures for GHA training.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top"><bold>Architectures</bold></th>
<th align="center" valign="top"><bold>Adders</bold></th>
<th align="center" valign="top"><bold>Multipliers</bold></th>
<th align="center" valign="top"><bold>Registers</bold></th>
<th align="center" valign="top"><bold>Latency</bold></th></tr></thead>
<tbody>
<tr>
<td align="left" valign="top">proposed Architecture</td>
<td align="center" valign="top"><italic>O</italic>(<italic>q</italic>)</td>
<td align="center" valign="top"><italic>O</italic>(<italic>q</italic>)</td>
<td align="center" valign="top"><italic>O</italic>(<italic>mp</italic>)</td>
<td align="center" valign="top"><italic>max</italic>(<italic>q</italic>, 2<italic>bp</italic> + <italic>s</italic>)</td></tr>
<tr>
<td align="left" valign="top">[<xref ref-type="bibr" rid="b18-sensors-12-06244">18</xref>]</td>
<td align="center" valign="top"><italic>O</italic>(<italic>mp</italic>)</td>
<td align="center" valign="top"><italic>O</italic>(<italic>mp</italic>)</td>
<td align="center" valign="top"><italic>O</italic>(<italic>mp</italic>)</td>
<td align="center" valign="top"><italic>max</italic>(<italic>q</italic> + 1,<italic>p</italic> + 1)</td></tr>
<tr>
<td align="left" valign="top">[<xref ref-type="bibr" rid="b19-sensors-12-06244">19</xref>]</td>
<td align="center" valign="top"><italic>O</italic>(<italic>p</italic>)</td>
<td align="center" valign="top"><italic>O</italic>(<italic>p</italic>)</td>
<td align="center" valign="top"><italic>O</italic>(<italic>mp</italic>)</td>
<td align="center" valign="top">3<italic>m</italic>+<italic>p</italic>−1</td></tr></tbody></table></table-wrap>
<table-wrap id="t2-sensors-12-06244" position="float">
<label>Table 2.</label>
<caption>
<p>Hardware resource consumption of the proposed GHA architecture for vector dimensions <italic>m</italic> = 16 × 16 and <italic>m</italic> = 32 × 32.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="middle" rowspan="3"><italic>p</italic></th>
<th colspan="3" align="center" valign="top"><bold>Proposed GHA with <italic>m</italic></bold> = <bold>16</bold> × <bold>16</bold></th>
<th colspan="3" align="center" valign="top"><bold>Proposed GHA with <italic>m</italic></bold> = <bold>32</bold> × <bold>32</bold></th></tr>
<tr>
<th colspan="3" align="center" valign="bottom">
<hr/></th>
<th colspan="3" align="center" valign="bottom">
<hr/></th></tr>
<tr>
<th align="center" valign="middle"><bold>LEs</bold></th>
<th align="center" valign="middle"><bold>Memory Bits</bold></th>
<th align="center" valign="middle"><bold>Embedded Multipliers</bold></th>
<th align="center" valign="middle"><bold>LEs</bold></th>
<th align="center" valign="middle"><bold>Memory Bits</bold></th>
<th align="center" valign="middle"><bold>Embedded Multipliers</bold></th></tr></thead>
<tbody>
<tr>
<td align="left" valign="top">3</td>
<td align="center" valign="top">35, 386/149, 760</td>
<td align="right" valign="top">0/6, 635, 520</td>
<td align="center" valign="top">704/720</td>
<td align="right" valign="top">85, 271/149, 760</td>
<td align="center" valign="top">7, 168/6, 635, 520</td>
<td align="center" valign="top">704/720</td></tr>
<tr>
<td align="left" valign="top">4</td>
<td align="center" valign="top">37, 731/149, 760</td>
<td align="right" valign="top">0/6, 635, 520</td>
<td align="center" valign="top">704/720</td>
<td align="right" valign="top">94, 244/149, 760</td>
<td align="center" valign="top">7, 168/6, 635, 520</td>
<td align="center" valign="top">704/720</td></tr>
<tr>
<td align="left" valign="top">5</td>
<td align="center" valign="top">40, 043/149, 760</td>
<td align="right" valign="top">7, 168/6, 635, 520</td>
<td align="center" valign="top">704/720</td>
<td align="right" valign="top">103, 394/149, 760</td>
<td align="center" valign="top">7, 168/6, 635, 520</td>
<td align="center" valign="top">704/720</td></tr>
<tr>
<td align="left" valign="top">6</td>
<td align="center" valign="top">42, 404/149, 760</td>
<td align="right" valign="top">7, 168/6, 635, 520</td>
<td align="center" valign="top">704/720</td>
<td align="right" valign="top">112, 679/149, 760</td>
<td align="center" valign="top">7, 168/6, 635, 520</td>
<td align="center" valign="top">704/720</td></tr>
<tr>
<td align="left" valign="top">7</td>
<td align="center" valign="top">44, 737/149, 760</td>
<td align="right" valign="top">7, 168/6, 635, 520</td>
<td align="center" valign="top">704/720</td>
<td align="right" valign="top">121, 940/149, 760</td>
<td align="center" valign="top">7, 168/6, 635, 520</td>
<td align="center" valign="top">704/720</td></tr></tbody></table></table-wrap>
<table-wrap id="t3-sensors-12-06244" position="float">
<label>Table 3.</label>
<caption>
<p>Hardware resource consumption of the SOPC system using proposed GHA architecture as hardware accelerator for vector dimensions <italic>m</italic> = 16 × 16 and <italic>m</italic> = 32 × 32.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center" valign="middle" rowspan="3"><italic>p</italic></th>
<th colspan="4" align="center" valign="top"><bold>Proposed SOPC with <italic>m</italic></bold> = <bold>16</bold> × <bold>16</bold></th>
<th colspan="3" align="center" valign="top"><bold>Proposed SOPC with <italic>m</italic></bold> = <bold>32</bold> × <bold>32</bold></th></tr>
<tr>
<th colspan="3" align="center" valign="bottom">
<hr/></th>
<th colspan="3" align="center" valign="bottom">
<hr/></th></tr>
<tr>
<th align="center" valign="middle"><bold>LEs</bold></th>
<th align="center" valign="middle"><bold>Memory Bits</bold></th>
<th align="center" valign="middle"><bold>Embedded Multipliers</bold></th>
<th align="center" valign="middle"><bold>LEs</bold></th>
<th align="center" valign="middle"><bold>Memory Bits</bold></th>
<th align="center" valign="middle"><bold>Embedded Multipliers</bold></th></tr></thead>
<tbody>
<tr>
<td align="left" valign="top">3</td>
<td align="center" valign="top">44, 377/149, 760</td>
<td align="center" valign="top">446, 824/6, 635, 520</td>
<td align="center" valign="top">708/720</td>
<td align="right" valign="top">94, 736/149, 760</td>
<td align="center" valign="top">453, 992/6, 635, 520</td>
<td align="center" valign="top">708/720</td></tr>
<tr>
<td align="left" valign="top">4</td>
<td align="center" valign="top">46, 786/149, 760</td>
<td align="center" valign="top">446, 824/6, 635, 520</td>
<td align="center" valign="top">708/720</td>
<td align="right" valign="top">103, 968/149, 760</td>
<td align="center" valign="top">453, 992/6, 635, 520</td>
<td align="center" valign="top">708/720</td></tr>
<tr>
<td align="left" valign="top">5</td>
<td align="center" valign="top">49, 096/149, 760</td>
<td align="center" valign="top">453, 992/6, 635, 520</td>
<td align="center" valign="top">708/720</td>
<td align="right" valign="top">113, 207/149, 760</td>
<td align="center" valign="top">453, 992/6, 635, 520</td>
<td align="center" valign="top">708/720</td></tr>
<tr>
<td align="left" valign="top">6</td>
<td align="center" valign="top">51, 449/149, 760</td>
<td align="center" valign="top">453, 992/6, 635, 520</td>
<td align="center" valign="top">708/720</td>
<td align="right" valign="top">122, 537/149, 760</td>
<td align="center" valign="top">453, 992/6, 635, 520</td>
<td align="center" valign="top">708/720</td></tr>
<tr>
<td align="left" valign="top">7</td>
<td align="center" valign="top">54, 055/149, 760</td>
<td align="center" valign="top">453, 992/6, 635, 520</td>
<td align="center" valign="top">708/720</td>
<td align="right" valign="top">131, 779/149, 760</td>
<td align="center" valign="top">453, 992/6, 635, 520</td>
<td align="center" valign="top">708/720</td></tr></tbody></table></table-wrap>
<table-wrap id="t4-sensors-12-06244" position="float">
<label>Table 4.</label>
<caption>
<p>Power Consumption of Various GHA Implementations.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top"><bold>GHA Implementations</bold></th>
<th align="center" valign="top"><bold>Proposed Architecture</bold></th>
<th align="center" valign="top"><bold>Multithreaded Software (16 threads)</bold></th>
<th align="center" valign="top"><bold>Multithreaded Software (16 threads)</bold></th></tr></thead>
<tbody>
<tr>
<td align="left" valign="top">Multicore Processor</td>
<td align="left" valign="top"/>
<td align="center" valign="top">Intel i7</td>
<td align="center" valign="top">Intel i5</td></tr>
<tr>
<td align="left" valign="top">FPGA Device</td>
<td align="center" valign="top">Altera Cyclone III EP3C120F780C8</td>
<td align="center" valign="top"/>
<td align="center" valign="top"/></tr>
<tr>
<td align="left" valign="top">Clock rate</td>
<td align="center" valign="top">50 MHz</td>
<td align="center" valign="top">2.8 GHz</td>
<td align="center" valign="top">1.6 GHz</td></tr>
<tr>
<td align="left" valign="top">Estimated Power</td>
<td align="center" valign="top">0.129 W</td>
<td align="center" valign="top">31.656 W</td>
<td align="center" valign="top">1.292 W</td></tr></tbody></table></table-wrap>
<table-wrap id="t5-sensors-12-06244" position="float">
<label>Table 5.</label>
<caption>
<p>Computation Time of Various GHA Architectures.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="left" valign="top"><bold>Architectures</bold></th>
<th align="center" valign="top"><bold>Proposed Architecture</bold></th>
<th align="center" valign="top">[<xref ref-type="bibr" rid="b18-sensors-12-06244">18</xref>]</th>
<th align="center" valign="top">[<xref ref-type="bibr" rid="b19-sensors-12-06244">19</xref>]</th></tr></thead>
<tbody>
<tr>
<td align="left" valign="top">Vector Dimension <italic>m</italic></td>
<td align="center" valign="top">16 × 16</td>
<td align="center" valign="top">4 × 4</td>
<td align="center" valign="top">16 × 8</td></tr>
<tr>
<td align="left" valign="top"># of Principal Components <italic>p</italic></td>
<td align="center" valign="top">16</td>
<td align="center" valign="top">4</td>
<td align="center" valign="top">16</td></tr>
<tr>
<td align="left" valign="top">FPGA Device</td>
<td align="center" valign="top">Altera Cyclone III EP3C120F780C8</td>
<td align="center" valign="top">Altera Cyclone III EP3C120F780C8</td>
<td align="center" valign="top">Xilinx Virtex 4 XC4VFX12</td></tr>
<tr>
<td align="left" valign="top">Clock Rate</td>
<td align="center" valign="top">100 MHz</td>
<td align="center" valign="top">75 MHz</td>
<td align="center" valign="top">136.243 MHz</td></tr>
<tr>
<td align="left" valign="top">Iteration Numbers</td>
<td align="center" valign="top">100</td>
<td align="center" valign="top">100</td>
<td align="center" valign="top">100</td></tr>
<tr>
<td align="left" valign="top"># of Training Vectors per Iteration</td>
<td align="center" valign="top">888 × 8</td>
<td align="center" valign="top">888 × 8</td>
<td align="center" valign="top">888 × 8</td></tr>
<tr>
<td align="left" valign="top">Computation Time</td>
<td align="center" valign="top">1.369 s</td>
<td align="center" valign="top">86.58 ms</td>
<td align="center" valign="top">2.09 s</td></tr></tbody></table></table-wrap></sec></back></article>
