<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xml:lang="en" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="nlm-ta">Sensors</journal-id>
<journal-title>Sensors</journal-title>
<issn pub-type="epub">1424-8220</issn>
<publisher>
<publisher-name>Molecular Diversity Preservation International (MDPI)</publisher-name></publisher></journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3390/s110807908</article-id>
<article-id pub-id-type="publisher-id">sensors-11-07908</article-id>
<article-categories>
<subj-group>
<subject>Article</subject></subj-group></article-categories>
<title-group>
<article-title>Network Coding on Heterogeneous Multi-Core Processors for Wireless Sensor Networks</article-title></title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Kim</surname><given-names>Deokho</given-names></name><xref ref-type="aff" rid="af1-sensors-11-07908"><sup>1</sup></xref></contrib>
<contrib contrib-type="author">
<name><surname>Park</surname><given-names>Karam</given-names></name><xref ref-type="aff" rid="af2-sensors-11-07908"><sup>2</sup></xref></contrib>
<contrib contrib-type="author">
<name><surname>Ro</surname><given-names>Won W.</given-names></name><xref ref-type="aff" rid="af1-sensors-11-07908"><sup>1</sup></xref><xref ref-type="corresp" rid="c1-sensors-11-07908"><sup>★</sup></xref></contrib></contrib-group>
<aff id="af1-sensors-11-07908">
<label>1</label> The School of Electrical and Electronic Engineering, Yonsei University, Seoul 120-749, Korea; E-Mail: <email>nautes87@yonsei.ac.kr</email></aff>
<aff id="af2-sensors-11-07908">
<label>2</label> Mobile Communications, Samsung Electronics, Suwon 443-373, Korea; E-Mail: <email>karam.park@samsung.com</email></aff>
<author-notes>
<corresp id="c1-sensors-11-07908">
<label>★</label>Author to whom correspondence should be addressed; E-Mail: <email>wro@yonsei.ac.kr</email>.</corresp></author-notes>
<pub-date pub-type="collection">
<year>2011</year></pub-date>
<pub-date pub-type="epub">
<day>11</day>
<month>8</month>
<year>2011</year></pub-date>
<volume>11</volume>
<issue>8</issue>
<fpage>7908</fpage>
<lpage>7933</lpage>
<history>
<date date-type="received">
<day>24</day>
<month>5</month>
<year>2011</year></date>
<date date-type="rev-recd">
<day>3</day>
<month>8</month>
<year>2011</year></date>
<date date-type="accepted">
<day>10</day>
<month>8</month>
<year>2011</year></date></history>
<permissions>
<copyright-statement>© 2011 by the authors; licensee MDPI, Basel, Switzerland.</copyright-statement>
<copyright-year>2011</copyright-year>
<license>
<p>This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).</p></license></permissions>
<abstract>
<p>While network coding is well known for its efficiency and usefulness in wireless sensor networks, the excessive costs associated with decoding computation and complexity still hinder its adoption into practical use. On the other hand, high-performance microprocessors with heterogeneous multi-cores would be used as processing nodes of the wireless sensor networks in the near future. To this end, this paper introduces an efficient network coding algorithm developed for the heterogenous multi-core processors. The proposed idea is fully tested on one of the currently available heterogeneous multi-core processors referred to as the Cell Broadband Engine.</p></abstract>
<kwd-group>
<kwd>network coding</kwd>
<kwd>sensor nodes</kwd>
<kwd>parallel algorithms</kwd>
<kwd>heterogeneous multi-core processors</kwd></kwd-group></article-meta></front>
<body>
<sec sec-type="intro">
<label>1.</label>
<title>Introduction</title>
<p>Network coding is a new coding technique first proposed by Ahlswede <italic>et al</italic>. to enhance network throughput and effectiveness on multi-nodal environments [<xref ref-type="bibr" rid="b1-sensors-11-07908">1</xref>] such as wireless sensor networks (WSN). A new paradigm has emerged for computer network systems enabled by network coding; advances in network coding techniques have influenced information and coding theory, computer network performance, and wired/wireless communication systems. In addition, network coding lends itself particularly well to multicasting, enhancing the effectiveness of multicasting compared to traditional coding approaches.</p>
<p>In fact, use of network coding techniques to various real world applications has been introduced [<xref ref-type="bibr" rid="b2-sensors-11-07908">2</xref>,<xref ref-type="bibr" rid="b3-sensors-11-07908">3</xref>]. Further more, network coding has the potential to deliver a number of benefits in various domains such as wireless networks, sensor networks, network security, peer-to-peer (P2P), and on-demand video streaming service [<xref ref-type="bibr" rid="b2-sensors-11-07908">2</xref>,<xref ref-type="bibr" rid="b4-sensors-11-07908">4</xref>–<xref ref-type="bibr" rid="b16-sensors-11-07908">16</xref>]. In wireless network systems, the network coding can increase transmission efficiency at routers by forwarding the coded packets as a passive acknowledgement [<xref ref-type="bibr" rid="b11-sensors-11-07908">11</xref>] and can increase performance of ad-hoc networks as well as save the energy with many to many broadcast environment [<xref ref-type="bibr" rid="b12-sensors-11-07908">12</xref>,<xref ref-type="bibr" rid="b13-sensors-11-07908">13</xref>]. In addition, the coded packets which are on the fly cannot be decoded until the sufficient number of packets are collected, thus network coding can simplify implementation of secure network as well [<xref ref-type="bibr" rid="b14-sensors-11-07908">14</xref>,<xref ref-type="bibr" rid="b15-sensors-11-07908">15</xref>]. In practical approach, Liu <italic>et al</italic>. analyze the performance of network coding on real-world commercial systems with 200 GBytes of real-world traces which had been collected during Summer Olympic Games in 2008 [<xref ref-type="bibr" rid="b16-sensors-11-07908">16</xref>].</p>
<p>While network coding has several advantages and is a promising technique for the future of network systems, one crucial drawback is the associated volume of computational overhead, which may hinder its adoption in practical use. Network coding requires encoding the data before it is sent and decoding it after it is received. However, the decoding algorithm has <italic>O</italic>(<italic>n</italic><sup>3</sup>) computational complexity, using a variant of Gaussian elimination where <italic>n</italic> is the size of a coefficient vector. The computation overhead associated with the decoding operation is very costly, especially with the low computing environment such as wireless sensor networks. As a result, the benefits of the network coding technique may be canceled out by the long decoding delay.</p>
<p>On the other hand, multi-core processors have recently become widespread and can be found in a variety of systems [<xref ref-type="bibr" rid="b17-sensors-11-07908">17</xref>], from high performance servers to special purpose wireless sensor networks [<xref ref-type="bibr" rid="b18-sensors-11-07908">18</xref>,<xref ref-type="bibr" rid="b19-sensors-11-07908">19</xref>]. In fact, the current research on sensor networks mainly uses a light-weighted processing node as a sensor node. However, we also expect that the future WSN systems would require more computing power, especially for the multimedia sensors. Therefore, the multi-core processor would be a possible choice for the sensor node. Especially, a prototype of multi-core platform as a sensor node is introduced in the previous research [<xref ref-type="bibr" rid="b18-sensors-11-07908">18</xref>]. This paper is based on the expectation that the future WSN would popularly use multi-core processors and require parallelized random linear network coding. In addition, using advanced microprocessor features such as the multimedia extension in WSN is investigated in the previous literature as well [<xref ref-type="bibr" rid="b20-sensors-11-07908">20</xref>]. In fact, processor development has resulted in a progressively increasing number of cores in a single chip. There are two kinds of multi-core processor design paradigm; one group integrates homogeneous multiple cores on a single chip whereas the other group incorporates heterogenous cores.</p>
<p>In this paper, we present a parallel algorithm of network coding for heterogeneous multi-core processors especially targeting to utilize the technique in WSN. We select the already available heterogeneous platform, the Cell BE, as a prototype of heterogeneous multi-core processors and adjust the workload distribution on each core for efficient network coding. The Cell BE is a heterogeneous multi-core processor designed to provide both generality and intensive computing power with the single instruction multiple data (SIMD) paradigm. Therefore, the design of Cell BE lends itself well to the adoption of SIMD which can be efficiently utilized in wireless multimedia sensor networks [<xref ref-type="bibr" rid="b20-sensors-11-07908">20</xref>]. Indeed, GPU also can be chosen as a high performance computing device for wireless sensor networks. However, we concern that using GPU requires additional general purpose processor support. This might introduce additional hardware and software overhead.</p>
<p>In fact, using the Cell BE processor in sensor nodes may not be so desirable due to the size and power consumption. However, the main target of this paper is to show the efficient parallel algorithms of the network coding on heterogeneous processors and demonstrate the possible advantages and feasibility of the algorithm. We formulate an appropriate load balancing method to achieve this, which is based on the concept of divisible load theory (DLT), which was initially introduced by Bharadwaj <italic>et al</italic>. and Drozdowski in the context of distributed and cluster systems [<xref ref-type="bibr" rid="b21-sensors-11-07908">21</xref>–<xref ref-type="bibr" rid="b23-sensors-11-07908">23</xref>]. In addition, we consider three different approaches incorporating parallelized decoding across the multiple heterogeneous cores, employing Galois field computation methods.</p>
<p>Via real machine experiments, we demonstrate that the proposed technique delivers improvements in decoding speed. With proper load balancing, we achieve a maximum speed-up of 2.15, compared to the performance results without load balancing. In addition, we compare our idea to the results obtained in two homogenous multi-core processors which provide competitive computing power. Compared to the Intel quad-core system, our approach achieves a maximum speed increase of 2.19, with 1 MB of data and a coefficient matrix of size 64 × 64. When we compare our performance to that of an AMD processor, we observe a maximum speed-up of 3.12 for 128 KB of data and a coefficient matrix of size 64 × 64.</p>
<p>The rest of this paper is organized as follows. We describe the network coding theory and brief overview of the Cell BE architecture in Section 2. Then, we propose parallelized network coding implementations for use on the Cell BE, as well as an extension to the SIMD instruction set in Section 3. In Section 4, experimental performance results are presented and analyzed. In Section 5, related works are explained. Finally, we conclude the paper in Section 6.</p></sec>
<sec>
<label>2.</label>
<title>Background</title>
<p>In this section, we will first introduce the overview of the Cell BE. In addition, some necessary knowledge on the concept of network coding will be presented.</p>
<sec>
<label>2.1.</label>
<title>Overview of the Cell BE</title>
<p>The Cell Broadband Engine (Cell BE) is a heterogeneous multiprocessor that was developed by Sony, IBM, and Toshiba in 2000. Although it has been long time from the first release of the Cell BE, it has 256 GFLOPS (Giga Floating Operation Per Second). It still provides good performance as a single chip processor compared to one of today’s high-performance commercial processors, Intel Core i7 series (Intel Core i7 975 has theoretical performance 221.44 GFLOPS) [<xref ref-type="bibr" rid="b24-sensors-11-07908">24</xref>]. In addition, the Cell BE is appropriate to show heterogeneous program models. The Cell BE consists of one Power 64 architecture processor, referred to as a <italic>Power Processor Element</italic> (PPE), and eight co-processors, referred to as <italic>Synergistic Processor Elements</italic> (SPEs). The Cell BE also includes Directed Memory Access (DMA) controller and high bandwidth data bus, referred to as an Element Interconnection Bus (EIB). These various components are presented in <xref ref-type="fig" rid="f1-sensors-11-07908">Figure 1</xref>. The Cell BE processor incorporates a Single Instruction Multiple Data (SIMD) execution unit, high power and area efficiency, large memory bandwidth, a large bandwidth on-chip coherent bus, and a high-bandwidth flexible I/O [<xref ref-type="bibr" rid="b25-sensors-11-07908">25</xref>].</p>
<p>The PPE is a dual-threaded, dual-in-order issue 64 bits Power-architecture processor. It has a 32 KB instruction cache and a 32 KB data cache, as well as a 512 KB L2 cache. In addition, the PPE has an <italic>AltiVec</italic> vector extension unit and floating point and integer SIMD instruction set.</p>
<p>The SPEs are composed of a Synergistic Processor Unit (SPU), a 256 KB local store, and a Memory Flow Control (MFC). The execution performance of the SPEs affects much of the overall computational performance of the Cell BE. The SPU contains a 128 bit wide dual-issue SIMD unit fully pipelined to all precisions, with the exception of the double precision vector unit. The SPE can access main storage with an effective address (EA) translation by MFC and asynchronously transfer data to local storage, which has both narrow (128 bits) and wide (128 bytes) features.</p>
<p>The Element Interconnect Bus (EIB) is a coherent bus that can transfer up to 96 bytes/s. It consists of four 16 bytes rings, each of which is only capable of unidirectional data transfer, clockwise or counter-clockwise, each ring supporting up to three simultaneous data transfers. The Cell BE employs dual channel <italic>Rambus</italic> XDR DRAM, which is capable of transferring 12.8 GB/s per 32 bits memory channel. It is therefore capable of supporting total bandwidth of 25.6 GB/s [<xref ref-type="bibr" rid="b26-sensors-11-07908">26</xref>].</p></sec>
<sec>
<label>2.2.</label>
<title>Benefit of Using Network Coding</title>
<p>We will introduce the principles and advantages of using network coding in this subsection. <xref ref-type="fig" rid="f2-sensors-11-07908">Figure 2</xref> presents a simple example of communication networks, which is represented as a directed graph [<xref ref-type="bibr" rid="b1-sensors-11-07908">1</xref>].</p>
<p>Each directed edge represents a pathway for information transfer. Node <italic>S</italic> represents source and the nodes <italic>D</italic> and <italic>E</italic> are destinations. The other nodes are intermediaries, routing information to the destination nodes. If we assume that each link is limited in bandwidth one bit per unit time, in a traditional routing protocol, we are incapable of attaining higher throughput than the given limit. However, using network coding, we can achieve better throughput in excess of this limit.</p>
<p>Let us assume that we have generated data bits <italic>a</italic> and <italic>b</italic> from source node <italic>S</italic>, and that we wish to route the data to destination nodes <italic>D</italic> and <italic>E</italic>. Data bit <italic>a</italic> is transported via path <italic>S-A-C</italic>, <italic>S-A-D</italic> and data bit <italic>b</italic> via <italic>S-B-C</italic>, <italic>S-B-E</italic>.</p>
<p>At the edge spanning nodes <italic>C</italic> and <italic>Z</italic>, constrained by our bandwidth limitation, we can only transport one of either <italic>a</italic> or <italic>b</italic>, per unit time. Suppose that we send <italic>a</italic> along the edge between nodes <italic>C</italic> and <italic>Z</italic>. In this case, node <italic>D</italic> could not receive <italic>b</italic> and would only be capable of receiving <italic>a</italic> twice, from <italic>A</italic> and <italic>Z</italic>. In addition, if we send <italic>b</italic> at the same time, node <italic>E</italic> would face the same problem. As it is not possible to transfer data bits <italic>a</italic> and <italic>b</italic> to both nodes <italic>D</italic> and <italic>E</italic> simultaneously, routing is inadequate.</p>
<p>When using network coding, we are able to generate new data by first encoding <italic>a</italic> and <italic>b</italic>, and then routing the encoded data through the directed linkage between nodes <italic>C</italic> and <italic>Z</italic>. As a simple example, we use a bitwise <italic>xor</italic> to encode data bits <italic>a</italic> and <italic>b</italic>. The new data is thus encoded as ‘<italic>a xor b</italic>’ and is sent along paths <italic>C-Z-D</italic> and <italic>C-Z-E</italic>, simultaneously. Node <italic>D</italic> would therefore receive data bits <italic>a</italic> and (<italic>a xor b</italic>) from nodes <italic>A</italic> and <italic>Z</italic>, respectively. Further, node <italic>E</italic> would receive both data <italic>b</italic> and (<italic>a xor b</italic>) from nodes <italic>B</italic> and <italic>Z</italic>. Therefore, both nodes <italic>D</italic> and <italic>E</italic> can collect data bits <italic>a</italic> and <italic>b</italic> using the <italic>xor</italic> operation. In conclusion, using a network coding technique allows us to achieve an enhanced multicast throughput of two bits to both nodes, subject to the same base network capacity of <italic>one bit per unit time</italic>.</p></sec>
<sec>
<label>2.3.</label>
<title>Random Linear Network Coding</title>
<p>To fully leverage the potential benefits of the network coding technique in a practical system, the encoding and decoding operations must be fast enough (<italic>i.e.</italic>, they must not act as bottlenecks to the transmission process). The execution time of the network coding is primarily dependent upon the coding method used. We employ the random linear coding [<xref ref-type="bibr" rid="b27-sensors-11-07908">27</xref>] in our Cell BE implementations, as it is widely used and known to be asymptotically optimal in any network format.</p>
<p>A given segment of data, such as a single file, will be divided into a specific number of blocks, referred to as <italic>packets</italic>, prior to being transferred over a network, as shown in <xref ref-type="fig" rid="f3-sensors-11-07908">Figure 3</xref>. In this figure, <bold>p</bold><italic><sub>k</sub></italic> represents <italic>k</italic><italic><sup>th</sup></italic> block and <bold>c</bold><italic><sub>i</sub></italic> is a coded data, which is a linear combination of blocks. In other words, 
<inline-formula>
<mml:math>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">c</mml:mi></mml:mrow>
<mml:mi>i</mml:mi></mml:msub>
<mml:mo>=</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="normal">Σ</mml:mi></mml:mrow>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn></mml:mrow>
<mml:mi>n</mml:mi></mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi>e</mml:mi></mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi></mml:mrow></mml:msub>
<mml:msub>
<mml:mrow>
<mml:mstyle fontweight="bold" fontstyle="normal">
<mml:mi>p</mml:mi></mml:mstyle></mml:mrow>
<mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math></inline-formula> where <italic>n</italic> is the number of blocks and the coefficient <bold>e</bold><italic><sub>i</sub></italic> is an element vector that is selected at random from a finite field, <italic>F</italic>. The coded data <bold>c</bold><italic><sub>i</sub></italic> is combined with the coefficient vector; [<italic>e<sub>i</sub></italic><sub>,1</sub>, ..., <italic>e<sub>i</sub></italic><sub>,</sub><italic><sub>n</sub></italic>] is stored in the header and broadcast to the destination. A transfer unit, comprised of the coded data and coefficient block, is presented in <xref ref-type="fig" rid="f4-sensors-11-07908">Figure 4</xref>.</p>
<p>While the packets are being routed, the packets are re-encoded within nodes along the pathways to their destinations before being passed to downstream nodes. When a packet arrives at its destination node, it is stored in local memory so the coded data can be decoded and recovered to the original data set [<italic>p</italic><sub>1</sub>, ..., <italic>p<sub>n</sub></italic>]<sup>T</sup>. To decode encoded data, the destination node must have all <italic>n</italic> transfer units, with linearly independent coefficient vectors. Suppose a destination node has collected <italic>n</italic> transfer units and that the coefficient vector, original data, and coded data set are represented by 
<inline-formula>
<mml:math>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>E</mml:mi></mml:mrow>
<mml:mi>T</mml:mi></mml:msup>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">[</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mstyle fontweight="bold" fontstyle="normal">
<mml:mi>e</mml:mi></mml:mstyle></mml:mrow>
<mml:mn>1</mml:mn>
<mml:mi>T</mml:mi></mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mstyle fontweight="bold" fontstyle="normal">
<mml:mi>e</mml:mi></mml:mstyle></mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>T</mml:mi></mml:msubsup>
<mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:math></inline-formula>, 
<inline-formula>
<mml:math>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>C</mml:mi></mml:mrow>
<mml:mi>T</mml:mi></mml:msup>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">[</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mstyle fontweight="bold" fontstyle="normal">
<mml:mi>c</mml:mi></mml:mstyle></mml:mrow>
<mml:mn>1</mml:mn>
<mml:mi>T</mml:mi></mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mstyle fontweight="bold" fontstyle="normal">
<mml:mi>c</mml:mi></mml:mstyle></mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>T</mml:mi></mml:msubsup>
<mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:math></inline-formula>, 
<inline-formula>
<mml:math>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi>P</mml:mi></mml:mrow>
<mml:mi>T</mml:mi></mml:msup>
<mml:mo>=</mml:mo>
<mml:mo stretchy="false">[</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mstyle fontweight="bold" fontstyle="normal">
<mml:mi>p</mml:mi></mml:mstyle></mml:mrow>
<mml:mn>1</mml:mn>
<mml:mi>T</mml:mi></mml:msubsup>
<mml:mo>,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mstyle fontweight="bold" fontstyle="normal">
<mml:mi>p</mml:mi></mml:mstyle></mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>T</mml:mi></mml:msubsup>
<mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:math></inline-formula>, respectively, where superscript <italic>T</italic> implies a matrix transpose operation. Since we multiply the matrices with formula <italic>C</italic> = <italic>EP</italic> to encode original data, we can rearrange this to obtain <italic>P</italic> = <italic>E</italic><sup>−1</sup><italic>C</italic>, allowing us to recover the original data by multiplying the inverse matrix of <italic>E</italic> with <italic>C</italic>. To perform the decoding operation, the coefficient matrix, E, must be an invertible matrix, thus all coefficient vectors, <bold>e</bold><italic><sub>i</sub></italic>, should be linearly independent of each other.</p>
<p>Using a variant of <italic>Gaussian Elimination</italic>, we can obtain matrix <italic>P</italic>. When the destination receives transfer units, it constructs coefficient and coded data matrices, as shown in <xref ref-type="fig" rid="f4-sensors-11-07908">Figure 4</xref>, to prepare for the process of Gaussian elimination. Typical Gaussian elimination or LU decomposition for the purpose of decoding at the destination requires that all <italic>n</italic> transfer units first be collected, before starting the process. However, we can use progressive decoding instead of multiplying by the inverse matrix. With the progressive decoding [<xref ref-type="bibr" rid="b28-sensors-11-07908">28</xref>], we do not need to wait until all transfer units to be received. Although all units may not have been received, the decoding process can still be initiated, and continue to progress as each unit is made available. In addition, the progressive network coding can be processed regardless of the arrival order of the coded packet. It is due to the fact that changing of the row order does not affect the decoding results as it uses linearly independent coefficient matrix.</p>
<p>Let <italic>n</italic> represent the number of blocks and <italic>m</italic> represent the block size. The computation complexity of standard Gaussian Elimination is <italic>O</italic>(<italic>n</italic><sup>3</sup>). However in the decoding process associated with network coding, there is an extra matrix of size <italic>m</italic>, represent by <bold>c</bold><italic><sub>i</sub></italic> in <xref ref-type="fig" rid="f4-sensors-11-07908">Figure 4</xref>. Therefore, the computational complexity in network coding is increased to <italic>O</italic>((<italic>n</italic> + <italic>m</italic>) × <italic>n</italic><sup>2</sup>).</p>
<p>An additional peer within a file swarming system can reduce download delay by <italic>n</italic>, receiving at most <italic>n</italic> block simultaneously. However, the resultant decoding delay, which increases in proportion to <italic>n</italic><sup>3</sup>, offsets the reduction in download delay, thus the benefit is canceled out. Therefore, in order to achieve some measure of benefit from a large <italic>n</italic>, an efficient, fast decoding implementation is required. That is, we can achieve greater performance gains in larger <italic>n</italic>, if we are able to overcome the computational delay.</p></sec>
<sec>
<label>2.4.</label>
<title>Overview of Progressive Network Coding</title>
<p>A variety of decoding methods that employ the random linear network coding technique is based on matrix inversion algorithms [<xref ref-type="bibr" rid="b29-sensors-11-07908">29</xref>,<xref ref-type="bibr" rid="b30-sensors-11-07908">30</xref>]. Though using the traditional algorithms is a proven method of parallel decoding in network environments, there is an additional cost incurred from network transmission delay. As the system must wait until all packets are received to compose the matrices used in the aforementioned traditional decoding algorithms, this delay is particularly problematic. As such, we can obtain greater performance using progressive decoding in packet switching network environments, which are subject to these transmission delays.</p>
<p>The traditional matrix inversion algorithms require a complete matrix to perform the decoding operation; this results in additional delays due to the waiting period. In contrast, progressive decoding requires only one row of the matrix to proceed with decoding. As such, progressive decoding is more suitable to network environments that are subject to long transmission delays.</p>
<p>The decoding process for traditional matrix inversion algorithms can be expressed with a computational complexity of <italic>O</italic>(<italic>n</italic><sup>3</sup>), after the last row has arrived. However, with the progressive decoding we can initiate the decoding process when as each row is received. Since we have already finished computation of all prior rows, the most recent row can be processed with complexity of <italic>O</italic>(<italic>n</italic><sup>2</sup>). In our evaluation, we employs progressive decoding to implement parallel decoding algorithms on the Cell BE.</p>
<p>Shojania and Li were the first to demonstrate the effectiveness of parallelization in network coding with their <italic>Progressive Parallelized Network Coding</italic> algorithms [<xref ref-type="bibr" rid="b28-sensors-11-07908">28</xref>]. However, our previous research has identified inefficiencies and unbalancing in their work, particularly with respect to large coefficient matrix sizes and has proposed Dynamic Vertical Partitioning (DVP) algorithm [<xref ref-type="bibr" rid="b31-sensors-11-07908">31</xref>]. We employ the DVP algorithm here for the Cell BE system, and suggest enhancements, which require a balanced workload implementation, across the heterogeneous multi-core processor.</p>
<p><xref ref-type="fig" rid="f5-sensors-11-07908">Figure 5</xref> presents the specific operations of progressive decoding, from <italic>Stage A</italic> to <italic>Stage E</italic> and <xref ref-type="table" rid="t1-sensors-11-07908">Table 1</xref>, which introduced in [<xref ref-type="bibr" rid="b28-sensors-11-07908">28</xref>] shows description of the operations and percentage of each operation step. In fact, <xref ref-type="fig" rid="f5-sensors-11-07908">Figure 5</xref> depicts a decoding process after operations on the (<italic>k</italic> − 1)<italic><sub>th</sub></italic>’s row has just been finished and the <italic>k<sub>th</sub></italic> row just arrives at the destination node.</p>
<p><xref ref-type="fig" rid="f5-sensors-11-07908">Figure 5(a)</xref> depicts the operations at <italic>Stage A</italic>; the decoding process begins in the second figure within <xref ref-type="fig" rid="f5-sensors-11-07908">Figure 5(a)</xref>. At the beginning, the first row is multiplied with the first element of the arriving row, and the resulting row is subtracted from the arriving row. The same operations are performed for the second row; it is multiplied with the second element of the arriving row and the resultant row is subtracted from the arriving row. These operations are continued until all leading values are reduced to “0”.</p>
<p>After the operations of <italic>Stage A</italic> are finished, the next decoding process identifies the first non-zero coefficient element (<italic>Stage B</italic>). It then determines whether the new row is linearly independent of the previously received rows (<italic>Stage C</italic>). The newly arriving row is then divided along the first non-zero element of the row, referred to as the <italic>pivot</italic> (<italic>Stage D</italic>) in <xref ref-type="fig" rid="f5-sensors-11-07908">Figure 5(b)</xref>. Finally, we reduce the values of this same column across all previous rows to “0” (<italic>Stage E</italic>) depicted in <xref ref-type="fig" rid="f5-sensors-11-07908">Figure 5(c)</xref>.</p></sec></sec>
<sec>
<label>3.</label>
<title>Load Distribution and Progressive Decoding on Cell BE</title>
<p>In the network coding research conducted previously by Shojania and Li [<xref ref-type="bibr" rid="b28-sensors-11-07908">28</xref>], computational effort is statistically distributed amongst multiple threads. However, as the size of the coefficient matrix increases, dynamically distributed computation has the potential to improves the performance with well distributed load balancing as demonstrated in our previous research [<xref ref-type="bibr" rid="b31-sensors-11-07908">31</xref>]. In this section, we first introduce the previously proposed three algorithms which are tested on the Cell BE system; in addition, we develop a new algorithm for using on the heterogeneous Cell BE processor considering the load balance.</p>
<p>In the previous work [<xref ref-type="bibr" rid="b31-sensors-11-07908">31</xref>], three types of partitioning algorithms have been proposed, including Horizontal Partitioning (HP), Row by Row Partitioning (RRP), and Dynamic Vertical Partitioning (DVP); the three approaches are presented in <xref ref-type="fig" rid="f6-sensors-11-07908">Figure 6</xref>. In this figure, each algorithm reflects the relevant operation in <italic>Stage E</italic> when the fourth row is received and subsequently parallelized into two threads. Both HP and RRP divide the workload on a row-by-row bases. However, HP divides rows between threads in a sequential manner, while RRP divides them using a round-robin approach. DVP divides the workload with vertical and only takes the computational region into consideration. To implement these three algorithms on the Cell BE processor, we use SPEs to decode and the PPE to manage the SPE threads, to handle the synchronization, and to decode partial data which is distributed with considering load balancing between the asymmetric core properties.</p>
<sec>
<label>3.1.</label>
<title>Synchronization on the Cell BE</title>
<p>For an efficient decoding operation, we first distribute the computational region as shown in <xref ref-type="fig" rid="f7-sensors-11-07908">Figure 7</xref>. The Cell processor provided in Play Station 3, which is our experimental platform, is configured with two of the eight SPE cores disabled; therefore, we can only use seven programmable cores as one PPE core and six SPE cores. As the PPE has dual-threaded and dual issue hardware, it has two threads running simultaneously. Different from the PPE, the SPEs are able to manage only one thread per core. As indicated by the thread distribution method detailed in <xref ref-type="fig" rid="f7-sensors-11-07908">Figure 7</xref>, PPE <italic>thread 1</italic> manages pivot column’s elements and should transfer the elements to the other SPE threads before processing a newly received row. In addition, the Cell BE has a communication system called <italic>mailbox</italic> which delivers 32 bit data between the cores [<xref ref-type="bibr" rid="b32-sensors-11-07908">32</xref>,<xref ref-type="bibr" rid="b33-sensors-11-07908">33</xref>]. In fact, we use the mailbox system to synchronize the threads as well as to transfer the elements.</p>
<p>The mailbox system is designed for each SPE and implemented with an asymmetric manner; both the <italic>inbound</italic> and <italic>outbound</italic> mailboxes are contained in each SPE and messages are transmitted to the MFC from the SPE via the EIB. The mailboxes have one outbound entry and four inbound entries. At each SPE, a 32 bit inbound mail is read by the SPE and an outbound mail is sent by the SPE. A reading operation from SPEs stalls when the inbound mailbox entry is empty. As soon as a new message becomes available, the reading operation resumes. This stalling is also caused for a writing operation when the outbound mailbox entry is full.</p>
<p>Synchronization can be achieved by using the outbound mailbox entry in the following manner. Each SPE writes a mail in the outbound entry and continuously checks whether the PPE reads the mail and makes the outbound entry empty. At receiving all mails from the SPEs, the synchronization is achieved. This also implies that the PPE is responsible to wait until it receives all the mails. After the synchronization is guaranteed, each SPE waits a reply which contains a pivot element from the PPE.</p>
<p>On the other hand, we also propose to use the inbound mailbox solely for synchronization, which provides better performance with simple implementation. The PPE transfers the pivot element to the inbound mailbox entry and the SPE continuously checks until the pivot element is completely transferred. In this way, we can simply eliminate the necessity of synchronization messages from the SPE side. This is possible due to the fact that any stalled reading operation with an empty inbound entry can be used for the synchronization purpose.</p></sec>
<sec>
<label>3.2.</label>
<title>Considering Load Balancing Effects on Cell BE</title>
<p>In this subsection, we propose our approach which enables an optimized workload distribution on the Cell BE. <xref ref-type="fig" rid="f7-sensors-11-07908">Figure 7</xref> depicts the computational area required to process and to dynamically distribute the workload. The previous work has already considered load balancing on a general, homogeneous processor (e.g., Intel or AMD multi-core processors), however, the Cell BE is a heterogeneous processor which has an asymmetric core architecture. The SPEs have been designed to deliver higher computational power than the PPE, especially with the SIMD instruction set. As such, we must consider the difference between these two types of cores in order to achieve proper workload distribution.</p>
<p>For that purpose, we first have defined a value called <italic>ppefactor</italic> which decides the workload distribution ratio of the PPE <italic>versus</italic> the SPE. For example, when ppefactor is set to 0.1, the PPE takes 10% of the available work, and the remainder is assigned to the SPE. Before considering load balancing on the asymmetric core architecture, the cores would have equally divided workload. In order to find the optimal workload distribution, the proposed idea is strongly dependent upon heuristic.</p>
<p>Once the workload distribution over PPE and SPE is defined, the data partitioning to use the SIMD instructions should be defined. Although the data computation region is dynamically partitioned by DVP, architectural optimization can be achieved as the Cell BE processor supports those SIMD instruction set. The SIMD instructions for PPE and SPEs enables 128 bit operations. For that reason, the data are divided into chunks each of which is as large as 16 bytes. When the size of data is not a multiple of 16, the remainder is assigned to PPE. For example, When a data size is 117 bytes, each chunk is constructed from the right most element in the data (the right most column). This means that there exist 7 chunks and remaining 5 bytes which are the left most 5 bytes. Then, the five bytes are assigned to one of the PPE threads and remaining 7 bytes are assigned to the other PPE thread and SPEs. This method is superior to the method in which each core has an equal number of elements.</p>
<p>After addressing the workload distribution on each thread, we need to select a proper computation method between the table-based approach and the loop-based approach for Galois field multiplications. We now explain these two approaches in Section 3.3 and the selected method is then fully tested in Section 4.2.</p></sec>
<sec>
<label>3.3.</label>
<title>Galois Field Operation for SIMD</title>
<p>The random linear network coding uses the Galois field numbers and accompanies computation overhead due to the time-consuming multiplication operations. In this subsection, we propose an optimization technique of Galois field operation which is previously proposed for GPU [<xref ref-type="bibr" rid="b34-sensors-11-07908">34</xref>].</p>
<p>Increasing granularity of the Galois multiplication is hard to expand when using a table look-up method [<xref ref-type="bibr" rid="b34-sensors-11-07908">34</xref>]. As the size of Galois field increases, memory requirement grows rapidly. In fact, increasing granularity of Galois field by 1 byte means a table size which is 256 times larger. This requires more cache and memory space. Furthermore, the SPEs do not have caches; they only have 256 KB SRAM, referred to as the local store. Therefore, it cannot contain any large sized tables or it can waste a large amount of local memory to hold the tables.</p>
<p>To provide sufficient granularity of the multiplications, Shojania <italic>et al</italic>. imported a loop-based approach which is based on the actual computations. Although the loop-based approach needs more computations than the table lookup method, it provides a faster computation time with the help of the SIMD instruction sets [<xref ref-type="bibr" rid="b28-sensors-11-07908">28</xref>,<xref ref-type="bibr" rid="b34-sensors-11-07908">34</xref>–<xref ref-type="bibr" rid="b36-sensors-11-07908">36</xref>].</p>
<p>In the previous work [<xref ref-type="bibr" rid="b34-sensors-11-07908">34</xref>], Shojania <italic>et al</italic>. suggested a word length wide multiplication method referred to as Rijndael’s finite field [<xref ref-type="bibr" rid="b37-sensors-11-07908">37</xref>,<xref ref-type="bibr" rid="b38-sensors-11-07908">38</xref>]. The method can perform four multiplication operations of the numbers in the Galois field at once. The Galois field numbers are as large as one byte and denoted as GF(2<sup>8</sup>).</p>
<p>They successfully applied the loop-based multiplication on the multiple scalar processors on a GPU which is depicted in <xref ref-type="fig" rid="f8-sensors-11-07908">Figure 8</xref>. The key optimization in the work is to eliminate branch operations by using polynomial mask operations. This helps to improve performance of a division operation with a irreducible polynomial variable. As a result, the execution time has been reduced.</p>
<p>Although they highly optimized the loop-based multiplication method by reducing diversity of control flow on branch instructions, there still exists possible reduction of one more branch instruction. Since the branch instruction causes stalls within a pipeline, any branch instruction in a loop crucially degrades performance of the Galois field multiplication; in fact, the speed of the Galois field multiplication highly affects the performance of network coding. For that purpose, we target to removes the remaining branch within the loop represented as (3) in <xref ref-type="fig" rid="f8-sensors-11-07908">Figure 8</xref>. This branch operation also can be replaced with the bitwise operations when the multiplication is optimized into the SIMD instructions. In addition, the replacement only causes less than five instructions to execute. On the other hand, if the branch instruction in line (2) is replaced to the bitwise operations, it requires a significant number of additional instructions to execute in the loop when the factor is zero. For that reason, the branch instruction in line (2) should remain for overall performance.</p>
<p>The proposed Galois field multiplication based on the SIMD instruction set is shown in <xref ref-type="fig" rid="f9-sensors-11-07908">Figure 9</xref>. The main difference compared to the previous code in <xref ref-type="fig" rid="f8-sensors-11-07908">Figure 8</xref> can be found in (4) to (6). The branch operation is replaced with the masking operation in <italic>ResultMask</italic> and the execution condition in the branch is calculated with a <italic>vecor_cmpeq</italic>, which is generally included in the SIMD instruction set. A <italic>vector_cmpeq</italic> operation checks whether each element in a vector is identical to the responsible element of the other vector. With comparing each element in both vectors, the operation set all bits of an element to 1 when the two elements are identical. Therefore, if the condition is true, the result becomes XOR-ed data (<xref ref-type="fig" rid="f9-sensors-11-07908">Figure 9(a)</xref>). Otherwise, the result is not changed (<xref ref-type="fig" rid="f9-sensors-11-07908">Figure 9(b)</xref>).</p>
<p>An SPE calculates 80 KB within 211 <italic>μ</italic>s with the original code. After the modification, an SPE finishes the same operation within 200 <italic>μ</italic>s. The optimization technique brings performance improvement 5.5%.</p></sec></sec>
<sec sec-type="methods|results">
<label>4.</label>
<title>Experimental Results and Analysis</title>
<p>In this section, we first evaluate the previous parallelized network coding algorithm developed for the homogeneous multi-core processors on the Cell BE; we simply translate the previous approach to the SIMD instruction set of the Cell BE. Then, we compare the multiplication methods which are table-based, loop-based, and using SIMD instruction set multiplication. Further, we compare parallelized decoding performance of applying the specific multiplication methods on PPE and SPEs. We also evaluate partitioning of PPE workload applying the three multiplication methods adaptively, using <italic>ppefactor</italic>. Finally, we evaluate our parallelized progressive decoding method on the Cell BE and we compare it to the commercially available homogeneous multi-core systems, such as Intel and AMD quad-cores. The specifications of the evaluation environments are described in <xref ref-type="table" rid="t2-sensors-11-07908">Table 2</xref>.</p>
<sec>
<label>4.1.</label>
<title>Implementation of Previous Work</title>
<p>We evaluate the previously proposed algorithms for homogeneous multi-core processors, HP, RRP, and DVP on the Cell BE architecture. Firstly, these algorithms are implemented with using only SPEs and SIMD instruction set for the SPE. <xref ref-type="fig" rid="f10-sensors-11-07908">Figure 10</xref> presents execution time on the decoding operation that was discussed in Section 2.4. Experimenting with the entire coefficient matrix size, HP and RRP exhibit similar performance. In contrast, DVP exhibits even better performance. As the SPEs decode the data without a data cache, the dissimilarity between the HP and RRP algorithms does not affect the required decoding time.</p>
<p>The maximum performance difference between HP and RRP is only 1.69% and on average, there is only 0.04% difference. In addition, DVP shows a maximum 31% enhancement over HP and RRP. Therefore, in the next section, we perform the remaining experiments using DVP. As in the homogenous multi-core processor, the advantage of DVP in terms of load balancing brings better results. Detailed explanation on DVP can be found in [<xref ref-type="bibr" rid="b31-sensors-11-07908">31</xref>].</p>
<p>The difference between Horizontal Partitioning (HP) and Row by Row Partitioning (RRP) comes from the different manner by which row is distributed to the different cores in <italic>Stage E</italic>. The results can be explained by the presence (or lack thereof) of a data cache. However, the Cell BE does not have a data cache on its SPEs. Therefore, there would be no distinctive difference between the two algorithms when implemented upon this architecture. In other words, the heterogeneous processor which has a simplified memory hierarchy to access local memory fast cannot provide efficiency of horizontal partitioning, even though it is a different and well balanced approach for cache embedded systems.</p></sec>
<sec>
<label>4.2.</label>
<title>Computation Time on Galois Field</title>
<p>In this subsection, we evaluate the decoding performance of each Galois field multiplication method. For the analysis, we choose to use the 128 bit SIMD instruction set to parallelize the Galois field multiplications.</p>
<p>Let <italic>COMPUTE</italic> represent the loop-based algorithm, <italic>TL</italic> the table-based algorithm, and <italic>VECTOR</italic> the parallelized SIMD implementation of the loop-based algorithm, for the Galois field operations. We estimate the performance of the three multiplication methods on real machines: an Intel <italic>Core 2 quad Q9400</italic>, an AMD <italic>Phenom-X4 9550</italic>, and the Cell BE, all of which are described in <xref ref-type="table" rid="t2-sensors-11-07908">Table 2</xref>.</p>
<p><xref ref-type="fig" rid="f11-sensors-11-07908">Figure 11</xref> presents the normalized performance of the <italic>TL</italic> and <italic>VECTOR</italic> methods over the <italic>COMPUTE</italic> method, on each type of core. All the cores display speed-up factors greater than ’1’ compared to the <italic>COMPUTE</italic> algorithm. In fact, <italic>COMPUTE</italic> obviously incurs greater overhead than <italic>TL</italic>, thus <italic>TL</italic> should be faster than <italic>COMPUTE</italic>. In addition, <italic>VECTOR</italic>, a parallelized method using SIMD instructions, is faster than all the other methods in processing 128 bit multiplications in parallel. The speed advantages in the PPE and SPE, obtained when using the <italic>VECTOR</italic> algorithm, are significant and noticeable.</p>
<p>In particular, the <italic>VECTOR</italic> algorithm executed on the PPE shows a speed increase by a factor of 7.71. Although the PPE incorporates data cache, just as other generic processors, the PPE has less than half of L1 data cache size compared to the other generic processors. In addition, the L2 cache is much smaller compared to the cache of the other general purpose processors. Therefore, the <italic>VECTOR</italic> algorithm exhibits a greater speed-up than other generic processors because it strongly depends on computation capability of SIMD execution unit. On the other hand, the SPE also has no data cache and merely has high-bandwidth embedded SRAM, referred to as the local store. However, it shows similar speed-up results compared to the other processors since its local store is as fast as the data caches.</p>
<p>In <xref ref-type="fig" rid="f12-sensors-11-07908">Figure 12</xref>, we present the speed increase exhibited by <italic>TL</italic> and <italic>VECTOR</italic> with respect to <italic>COMPUTE</italic>, in the performance of actual decoding, using each method on the Cell BE architecture. Each method using DVP (PPE with 2 threads and SPE with 6 threads) is evaluated on the different core architecture, with varying data sizes between 16 KB and 1 MB, on a coefficient matrix of size 128. In real decoding processes, the speed increase of <italic>TL</italic> is increased on PPE, but decreased on SPE compared with result on <xref ref-type="fig" rid="f11-sensors-11-07908">Figure 11</xref>. On the other hand, <italic>VECTOR</italic> shows lower performance. As <italic>TL</italic> depends on the performance of cache rather than computing power and performance of entire decoding process affected by cache, <italic>TL</italic> shows the improved performance with PPE. However, with the absence of data cache, SPE shows lower performance using <italic>TL</italic>. Furthermore, <italic>VECTOR</italic> requires computing power and entire decoding process has additional overhead compared to the single multiplication. Therefore, it represents lower speed-ups compared to the results shown on <xref ref-type="fig" rid="f11-sensors-11-07908">Figure 11</xref>.</p>
<p>Despite of the low performance on small data size of SPE, SPE represents similar speed-ups when data size becomes large. Since SPE should be controlled by PPE to synchronize the decoding process between SPEs and transfer data from main memory, the SPE shows lower performance with small data size when the synchronization and data transfer overhead charges large proportion.</p>
<p>From the results in <xref ref-type="fig" rid="f12-sensors-11-07908">Figure 12</xref>, we intuitively find the parallelized SIMD multiplication is the optimal solution to achieve high-performance decoding.</p></sec>
<sec>
<label>4.3.</label>
<title>Synchronization with Mailbox System</title>
<p>In Section 3.1, we introduce an efficient way to implement synchronization with the asymmetric mailbox system. With the inbound mailbox, the cores can synchronize at each decoding steps and can share values in the pivot column at once. We have compared decoding speed of the two synchronization methods based on inbound mailbox and outbound mailbox respectively in <xref ref-type="fig" rid="f13-sensors-11-07908">Figure 13</xref>. For the three kinds of computation approaches, COMPUTE, TL, and VECTOR, the decoding procedure is tested with varying the synchronization method and simply divided workload for each thread.</p>
<p>In experimental results, the synchronization method, which combines synchronization and the data transfer, reduces more than 10% of decoding time. COMPUTE and TL show remarkable reduced results since the three methods already have severe synchronization overhead by unfairness of workload distribution which does not consider the computation capability different types of cores. Consequently, the synchronization with inbound mailbox systems reduces performance degradation by inefficient synchronization methods and the performance degradation caused by absence of proper workload distribution. The performance improvement by well balanced workload is tested in the next subsection.</p></sec>
<sec>
<label>4.4.</label>
<title>Partitioning on PPE</title>
<p>In Section 3.2, we explained the different factors that must be considered in determining workload distribution for the PPE and the SPEs. We have examined three multiplication methods on the PPE and compared result of each method to the performance achieved with utilizing only the SPEs. The performance results are depicted in <xref ref-type="fig" rid="f13-sensors-11-07908">Figure 13</xref>. The amount of workload dedicated on PPE is as large as the amount assigned to one SPE. We employ parallelized multiplication using the SIMD instruction set on the SPEs, rather than table-based multiplication, which is better suited to processors that have a local cache. Even if we also use the PPE in decoding, <xref ref-type="fig" rid="f14-sensors-11-07908">Figure 14</xref> shows that lower increases in speed occur than are witnessed when only using the SPEs (with the exception of <italic>PPE_VECTOR</italic>).</p>
<p>In this section, we propose an approach to the factorization of workload between cores, and we evaluate the decoding time when varying distribution factor, which we refer to as <italic>ppefactor</italic>. Then, we use the configuration with equally divided workload distribution at each core as a performance baseline.</p>
<p><xref ref-type="fig" rid="f15-sensors-11-07908">Figure 15</xref> presents average speed-ups observed when varying the <italic>ppefactor</italic> for <italic>PPE_COMPUTE</italic>, <italic>PPE_TL</italic>, and <italic>PPE_VECTOR</italic> algorithms. <italic>PPE_VECTOR</italic> is a unified parallel algorithm that uses parallelized SIMD multiplication on either of the PPE and SPEs. In contrast, <italic>PPE_COMPUTE</italic> and <italic>PPE_TL</italic> are hybrid parallel algorithms. These employ computation-based and table-based multiplication on the PPE and they use only parallel SIMD multiplication on the SPEs.</p>
<p>In order to parallelize the progressive decoding algorithm across multiple cores, it is necessary to have a synchronization barrier that blocks excessive progression by any one particular thread. Synchronization employing a barrier greatly decreases performance when load balancing results in uneven distribution between cores. Thus, we evaluate the sensitivity of the three algorithms with respect to <italic>ppefactor</italic>. We do this because it will be necessary to dynamically redistribute the workload to all threads in an efficient manner, from a performance perspective.</p>
<p><xref ref-type="fig" rid="f16-sensors-11-07908">Figure 16</xref> depicts the measured average increase in speed that is observed when we decode data size from 16 KB to 1 MB, with a coefficient matrix varying in size from 64 to 512, with optimal values of <italic>ppefactor</italic>. For the small data size, since the portion that is assigned to PPE is smaller than the portion to SPE, its variation of the factor does not significantly affect the performance. However, if the data size is large enough to compare with coefficient matrix size then it shows high speed-up results.</p>
<p>In <xref ref-type="fig" rid="f15-sensors-11-07908">Figure 15(a–c)</xref>, we can identify the most relevant local maximum values represented in <xref ref-type="table" rid="t3-sensors-11-07908">Table 3</xref>, associated with each method. In <xref ref-type="table" rid="t3-sensors-11-07908">Table 3</xref>, we have realized performance increase 8% with <italic>ppefactor</italic> of 2.38 even with <italic>PPE_VECTOR</italic>. It means that PPE is assigned a greater workload than the SPE. With this fine tuning on workload distribution, parallelization using the SIMD instruction set results in high performance on the Cell BE.</p></sec>
<sec>
<label>4.5.</label>
<title>Overall Decoding Performance</title>
<p>We compare the performance results of our factorized parallelization, to the results obtained using a <italic>ppefactor</italic> of 1 in <xref ref-type="fig" rid="f16-sensors-11-07908">Figure 16</xref>. It presents comparison between the performance exhibited both before and after the factorization of the PPE, with varying sizes of the coefficient matrixes, from 64 to 512. After identifying the optimal <italic>ppefactor</italic>, we obtain a speed increase of more than 1.5, using <italic>PPE_COMPUTE</italic>. On the other hand, <italic>PPE_VECTOR</italic> and <italic>PPE_TL</italic> exhibit negligible speed increases. These results are arranged and presented in <xref ref-type="table" rid="t3-sensors-11-07908">Table 3</xref>. It is readily apparent that factorization is an important consideration when we decompose and rearrange tasks on heterogeneous multi-core processors.</p>
<p><xref ref-type="fig" rid="f17-sensors-11-07908">Figure 17</xref> presents observed decoding times with coefficient matrices of varying sizes, from 64 to 512, when decoding different volumes of data, between 16 KB and 1 MB. We have shown above, in Section 4.2, that the parallelized Galois field multiplications using the SIMD instruction is the fastest implementation method on a homogeneous multi-core processor. In order to ensure a legitimate comparison with our implementations on the Cell BE, we implemented network coding on the Intel and AMD quad-core processors using only SIMD instructions. We have compared computing-based (<italic>PPE_COMPUTE</italic>), table-based (<italic>PPE_TL</italic>), and SIMD-based (<italic>PPE_VECTOR</italic>) multiplication methods on the PPE to the SPE using SIMD-based multiplication. In addition, the implementations of <italic>PPE_COMPUTE</italic>, <italic>PPE_TL</italic>, and <italic>PPE_VECTOR</italic> exhibit average increases in speed of 0.32, 0.88, and 2.38, respectively, under experimental evaluation, as noted in Section 4.4. All implementations are compiled with the O3 level of the GNU GCC.</p>
<p>In <xref ref-type="fig" rid="f17-sensors-11-07908">Figure 17</xref>, it can be seen that <italic>PPE_COMPUTE</italic> demonstrates a low decoding speed when dealing with small data, however, it performs in a manner comparable to homogeneous processors as data size increases. This is because it incurs delay when the PPE is forced to wait for the SPEs during the decoding operation. In contrast, the other multiplication methods, which use table-based or parallelized SIMD-based multiplication on the PPE and parallelized SIMD-based multiplication on the SPE, on the Cell BE, exhibit fast decoding times in all experimental ranges. This gap increases with data size, as the gains from parallelization are enhanced.</p>
<p><xref ref-type="fig" rid="f18-sensors-11-07908">Figure 18</xref> shows the average speed-ups varying the data size for all coefficient sizes; 64, 128, 256, and 512. It shows that the speed-ups are improved proportional to the data size; as the amount of computation increases, more data transmission to SPEs from main memory can be hidden. As we have shown in <xref ref-type="fig" rid="f17-sensors-11-07908">Figures 17</xref> and <xref ref-type="fig" rid="f18-sensors-11-07908">18</xref>, Cell BE is efficient for large data size of network coding when we use, especially, parallelized SIMD instruction.</p></sec></sec>
<sec>
<label>5.</label>
<title>Related Work</title>
<p>Ahlswede <italic>et al</italic>. were the first to introduce network coding and demonstrate its usefulness [<xref ref-type="bibr" rid="b1-sensors-11-07908">1</xref>]. After this initial work, the maximum theoretical throughput of network coding was proven, and achieved, using linear network codes, by Koetter and Medard [<xref ref-type="bibr" rid="b39-sensors-11-07908">39</xref>]. As suggested by Chou <italic>et al</italic>. [<xref ref-type="bibr" rid="b27-sensors-11-07908">27</xref>] and Ho <italic>et al</italic>. [<xref ref-type="bibr" rid="b40-sensors-11-07908">40</xref>], our implementations employ random linear network coding, which is believed to be the most practical approach to multicast flow scenarios, as the target to parallelize. Network coding research then spread to wireless network systems after its utility had been demonstrated by Lun <italic>et al</italic>. [<xref ref-type="bibr" rid="b41-sensors-11-07908">41</xref>] in that context. Katti <italic>et al</italic>. proposed a number of practical solutions using multiple unicast flows [<xref ref-type="bibr" rid="b42-sensors-11-07908">42</xref>] and Park <italic>et al</italic>. showed improvements in the reliability of ad hoc network systems [<xref ref-type="bibr" rid="b43-sensors-11-07908">43</xref>].</p>
<p>The applications of network coding have been proposed in [<xref ref-type="bibr" rid="b44-sensors-11-07908">44</xref>] and recent studies of feasibility in real testbeds have been performed and documented [<xref ref-type="bibr" rid="b45-sensors-11-07908">45</xref>]. Especially, several previous literatures introduced to use network coding techniques in wireless sensor networks [<xref ref-type="bibr" rid="b5-sensors-11-07908">5</xref>–<xref ref-type="bibr" rid="b10-sensors-11-07908">10</xref>]. Widmer and Le Boudec introduced a network coding based forwarding scheme for wireless sensor networks where nodes sleep most of the time [<xref ref-type="bibr" rid="b5-sensors-11-07908">5</xref>]. Al-Kofahi and Kamal handle the problem of survivability of many-to-one flows in wireless sensor networks (WSN) using the network coding technique [<xref ref-type="bibr" rid="b6-sensors-11-07908">6</xref>]. In addition, Hou <italic>et al</italic>. proposed AdapCode, which is a reliable data dissemination protocol developed for any software update. Their proposed method relies on adaptive network coding to reduce broadcast traffics in the process of dissemination [<xref ref-type="bibr" rid="b9-sensors-11-07908">9</xref>]. Using network coding in the design of practical health care wireless sensor networks is also presented in [<xref ref-type="bibr" rid="b10-sensors-11-07908">10</xref>]. Using multi-core processors in the cloud computing environment also has been proposed [<xref ref-type="bibr" rid="b46-sensors-11-07908">46</xref>].</p>
<p>In addition, Lee <italic>et al</italic>. introduced a discussion of the utility of network coding in mobile systems [<xref ref-type="bibr" rid="b47-sensors-11-07908">47</xref>]. Further, Gkantsidis <italic>et al</italic>. showed that smooth, fast downloads and efficient server utilization can be achieved using network coding [<xref ref-type="bibr" rid="b4-sensors-11-07908">4</xref>]. Lastly, Shojania and Li consider adoption the network coding to practical applications in mobile networks with the Apple iPhone [<xref ref-type="bibr" rid="b48-sensors-11-07908">48</xref>].</p>
<p>Parallelized network coding was first suggested by Shojania and Li [<xref ref-type="bibr" rid="b28-sensors-11-07908">28</xref>]. The authors used hardware acceleration and proposed a multi-threaded design utilizing multi-core systems. Research has also been conducted, from a variety of perspectives, which focuses on reducing the computational complexity of encoding/decoding operations [<xref ref-type="bibr" rid="b49-sensors-11-07908">49</xref>,<xref ref-type="bibr" rid="b50-sensors-11-07908">50</xref>]. Park <italic>et al</italic>. suggested enhanced forms of parallelization network coding algorithms with reduced computational complexity [<xref ref-type="bibr" rid="b31-sensors-11-07908">31</xref>,<xref ref-type="bibr" rid="b51-sensors-11-07908">51</xref>]. Whereas, our work is focused on improving decoding performance via the adoption of algorithms for use in a heterogeneous processor, referred to as the Cell BE.</p>
<p>Many algorithms have been proposed to parallelize matrix calculation, such as the parallelization of matrix inversion [<xref ref-type="bibr" rid="b52-sensors-11-07908">52</xref>], parallel LU decomposition [<xref ref-type="bibr" rid="b29-sensors-11-07908">29</xref>], and parallelization of Gauss-Jordan elimination with block-based algorithms [<xref ref-type="bibr" rid="b30-sensors-11-07908">30</xref>]. However, due to the network transfer delay, Park <italic>et al</italic>. employ a more aggressive method of network coding, referred to as <italic>“progressive”</italic> decoding [<xref ref-type="bibr" rid="b28-sensors-11-07908">28</xref>].</p>
<p>Approaches to enhancing the performance of the progressive decoding were proposed in <italic>Parallelized Progressive Network Coding</italic> [<xref ref-type="bibr" rid="b28-sensors-11-07908">28</xref>]. The approaches are based on Gauss–Jordan elimination algorithm. A simple description of one variant of Gauss–Jordan elimination, as explained in [<xref ref-type="bibr" rid="b28-sensors-11-07908">28</xref>], is presented in <xref ref-type="table" rid="t1-sensors-11-07908">Table 1</xref> of this paper. Over the entire decoding process, <italic>Stage A</italic> and <italic>E</italic> comprise the majority of the workload; according to [<xref ref-type="bibr" rid="b28-sensors-11-07908">28</xref>], <italic>Stage A</italic> makes up 50.05% of the workload, while <italic>Stage E</italic> has 49.5%.</p>
<p>The load-balancing problem has been emphasized in divisible load theory [<xref ref-type="bibr" rid="b21-sensors-11-07908">21</xref>–<xref ref-type="bibr" rid="b23-sensors-11-07908">23</xref>]. Drozdowski and Lawenda introduced a method of verifying divisible load size for heterogeneous distributed systems [<xref ref-type="bibr" rid="b53-sensors-11-07908">53</xref>]. Cariño <italic>et al</italic>. suggested a factoring method for dynamical load-balancing in [<xref ref-type="bibr" rid="b54-sensors-11-07908">54</xref>]. The usefulness of hardware acceleration has been shown by Shojania <italic>et al</italic>. [<xref ref-type="bibr" rid="b34-sensors-11-07908">34</xref>] and Chu <italic>et al</italic>. [<xref ref-type="bibr" rid="b55-sensors-11-07908">55</xref>] on a GPGPU.</p></sec>
<sec sec-type="conclusions">
<label>6.</label>
<title>Conclusions</title>
<p>In this paper, we introduced an efficient random linear network coding algorithm with an appropriate load balancing method for a heterogeneous multi-core processor. We especially designed the proposed architecture considering the wireless sensor network environment. Our algorithm introduced a proper load balancing method and a hybrid progressive decoding algorithm considering different computing capability of cores. We achieve a maximum speed increase by selectively using multiplication algorithms that are (1) table-based in dealing with small coefficient and data sizes and (2) parallelized and employing SIMD instructions in dealing with large coefficient sizes as shown in <xref ref-type="fig" rid="f19-sensors-11-07908">Figure 19</xref>.</p>
<p>We compared performance of the proposed approach to one of the fastest progressive decoding algorithms, executed on homogeneous processors. From this comparison, we demonstrated improved performance results using our method. <xref ref-type="table" rid="t4-sensors-11-07908">Table 4</xref> represents maximum and average speed-ups of network coding about various matrix sizes (64, 128, 256, and 512) compared to the homogeneous processors. Our proposed implementation shows improved performance in most of the experiments. We achieved a maximum speed-up of 2.19 at 1 MB data with a coefficient matrix of 64 compared to the Intel quad-core processor. In addition, we obtained a maximum speed-up of 3.12 at 128 KB data with coefficient matrix of 64 compared to the AMD quad-core processor. The proposed method shows greater efficiency in dealing with especially large data sizes.</p></sec></body>
<back>
<ack>
<p>This work was supported by the Korea Research Foundation Grant funded by the Korean Government (KRF-2008-313-D00871).</p></ack>
<ref-list>
<title>References</title>
<ref id="b1-sensors-11-07908"><label>1.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ahlswede</surname><given-names>R</given-names></name><name><surname>Ning</surname><given-names>C</given-names></name><name><surname>Li</surname><given-names>S-YR</given-names></name><name><surname>Yeung</surname><given-names>RW</given-names></name></person-group><article-title>Network information flow</article-title><source>IEEE Trans. Inf. Theory</source><year>2000</year><volume>46</volume><fpage>1204</fpage><lpage>1216</lpage><pub-id pub-id-type="doi">10.1109/18.850663</pub-id></citation></ref>
<ref id="b2-sensors-11-07908"><label>2.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Sanchez-Avila</surname><given-names>C</given-names></name><name><surname>Sanchez-Reillol</surname><given-names>R</given-names></name></person-group><article-title>The Rijndael Block Cipher (AES proposal): A comparison with DES</article-title><conf-name>Proceedings of 2001 IEEE the 35th International Carnahan Conference on Security Technology</conf-name><conf-loc>London, UK</conf-loc><conf-date>16–19 October 2001</conf-date><fpage>229</fpage><lpage>234</lpage></citation></ref>
<ref id="b3-sensors-11-07908"><label>3.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname><given-names>B</given-names></name><name><surname>Wu</surname><given-names>Y</given-names></name></person-group><article-title>Network coding</article-title><source>Proc. IEEE</source><year>2011</year><volume>99</volume><fpage>363</fpage><lpage>365</lpage><pub-id pub-id-type="doi">10.1109/JPROC.2010.2096251</pub-id></citation></ref>
<ref id="b4-sensors-11-07908"><label>4.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Gkantsidis</surname><given-names>C</given-names></name><name><surname>Miller</surname><given-names>J</given-names></name><name><surname>Rodriguez</surname><given-names>P</given-names></name></person-group><article-title>Comprehensive view of a live network coding P2P system</article-title><conf-name>Proceedings of the 6th ACM SIGCOMM Conference on Internet Measurement, IMC ’06</conf-name><conf-loc>Rio de Janeiro, Brazil</conf-loc><conf-date>October 2006</conf-date><fpage>177</fpage><lpage>188</lpage></citation></ref>
<ref id="b5-sensors-11-07908"><label>5.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Widmer</surname><given-names>J</given-names></name><name><surname>Le Boudec</surname><given-names>JY</given-names></name></person-group><article-title>Network coding for efficient communication in extreme networks</article-title><conf-name>Proceedings of the 2005 ACM SIGCOMM Workshop on Delay-Tolerant Networking, WDTN ’05</conf-name><conf-loc>Philidelphia, PA, USA</conf-loc><conf-date>August 2005</conf-date><fpage>284</fpage><lpage>291</lpage></citation></ref>
<ref id="b6-sensors-11-07908"><label>6.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Al-Kofahi</surname><given-names>OM</given-names></name><name><surname>Kamal</surname><given-names>AE</given-names></name></person-group><article-title>Network coding-based protection of many-to-one wireless flows</article-title><source>IEEE J. Sel. Areas Commun</source><year>2009</year><volume>27</volume><fpage>797</fpage><lpage>813</lpage><pub-id pub-id-type="doi">10.1109/JSAC.2009.090619</pub-id></citation></ref>
<ref id="b7-sensors-11-07908"><label>7.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Woldegebreal</surname><given-names>DH</given-names></name><name><surname>Karl</surname><given-names>H</given-names></name></person-group><article-title>Network-coding-based cooperative transmission in wireless sensor networks: Diversity-multiplexing tradeoff and coverage area extension</article-title><conf-name>Proceedings of the 5th European Conference on Wireless Sensor Networks, EWSN’08</conf-name><conf-loc>Bologna, Italy</conf-loc><conf-date>30 January– 1 February 2008</conf-date><fpage>141</fpage><lpage>155</lpage></citation></ref>
<ref id="b8-sensors-11-07908"><label>8.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Platz</surname><given-names>D</given-names></name><name><surname>Woldegebreal</surname><given-names>DH</given-names></name><name><surname>Karl</surname><given-names>H</given-names></name></person-group><article-title>Random network coding in wireless sensor networks: Energy efficiency via cross-layer approach</article-title><conf-name>Proceedings of 2008 IEEE the 10th International Symposium on Spread Spectrum Techniques and Applications, ISSSTA ’08</conf-name><conf-loc>Bologna, Italy</conf-loc><conf-date>25–28 August 2008</conf-date><fpage>654</fpage><lpage>660</lpage></citation></ref>
<ref id="b9-sensors-11-07908"><label>9.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Hou</surname><given-names>IH</given-names></name><name><surname>Tsai</surname><given-names>YE</given-names></name><name><surname>Abdelzaher</surname><given-names>T</given-names></name><name><surname>Gupta</surname><given-names>I</given-names></name></person-group><article-title>AdapCode: Adaptive network coding for code updates in wireless sensor networks</article-title><conf-name>Proceedings of the IEEE 27th Conference on Computer Communications, INFOCOM 2008</conf-name><conf-loc>Phoenix, AZ, USA</conf-loc><conf-date>13–18 April 2008</conf-date><fpage>1517</fpage><lpage>1525</lpage></citation></ref>
<ref id="b10-sensors-11-07908"><label>10.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Egbogah</surname><given-names>EE</given-names></name><name><surname>Fapojuwo</surname><given-names>AO</given-names></name></person-group><article-title>A survey of system architecture requirements for health care-based wireless sensor networks</article-title><source>Sensors</source><year>2011</year><volume>11</volume><fpage>4875</fpage><lpage>4898</lpage><pub-id pub-id-type="doi">10.3390/s110504875</pub-id><pub-id pub-id-type="pmid">22163881</pub-id></citation></ref>
<ref id="b11-sensors-11-07908"><label>11.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Wu</surname><given-names>Y</given-names></name><name><surname>Chou</surname><given-names>PA</given-names></name><name><surname>Kung</surname><given-names>SY</given-names></name></person-group><source>Information Exchange in Wireless Networks with Network Coding and Physical-Layer Broadcast</source><comment>Technical Report MSR-TR-2004-78;</comment><publisher-name>Microsoft Research</publisher-name><publisher-loc>Cambridge, UK</publisher-loc><year>2004</year></citation></ref>
<ref id="b12-sensors-11-07908"><label>12.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Widmer</surname><given-names>J</given-names></name><name><surname>Fragouli</surname><given-names>C</given-names></name><name><surname>LeBoude</surname><given-names>JY</given-names></name></person-group><article-title>Energy efficient broadcasting in wireless <italic>ad hoc</italic> networks</article-title><conf-name>Proceedings of First Workshop on Network Coding, Theory, and Applications, NetCod ’05</conf-name><conf-loc>Riva del Garda, Italy</conf-loc><conf-date>7 April 2005</conf-date></citation></ref>
<ref id="b13-sensors-11-07908"><label>13.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Fragouli</surname><given-names>C</given-names></name><name><surname>Widmer</surname><given-names>J</given-names></name><name><surname>Le Boudec</surname><given-names>JY</given-names></name></person-group><article-title>A network coding approach to energy efficient broadcasting: From theory to practice</article-title><conf-name>Proceedings the 25th IEEE International Conference on Computer Communications, INFOCOM 2006</conf-name><conf-loc>Barcelona, Spain</conf-loc><conf-date>April 2006</conf-date></citation></ref>
<ref id="b14-sensors-11-07908"><label>14.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Cai</surname><given-names>N</given-names></name><name><surname>Yeung</surname><given-names>R</given-names></name></person-group><article-title>Secure network coding</article-title><conf-name>Proceedings of 2002 IEEE International Symposium on Information Theory</conf-name><conf-loc>Lausanne, Switzerland</conf-loc><conf-date>30 June–5 July 2002</conf-date><fpage>323</fpage></citation></ref>
<ref id="b15-sensors-11-07908"><label>15.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cai</surname><given-names>N</given-names></name><name><surname>Yeung</surname><given-names>R</given-names></name></person-group><article-title>Secure network coding on a wiretap network</article-title><source>IEEE Trans. Inf. Theory</source><year>2011</year><volume>57</volume><fpage>424</fpage><lpage>435</lpage><pub-id pub-id-type="doi">10.1109/TIT.2010.2090197</pub-id></citation></ref>
<ref id="b16-sensors-11-07908"><label>16.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Liu</surname><given-names>Z</given-names></name><name><surname>Wu</surname><given-names>C</given-names></name><name><surname>Li</surname><given-names>B</given-names></name><name><surname>Zhao</surname><given-names>S</given-names></name></person-group><article-title>UUSee: Large-Scale operational on-demand streaming with random network coding</article-title><conf-name>Proceedings of IEEE INFOCOM</conf-name><conf-loc>San Diego, CA, USA</conf-loc><conf-date>14–19 March 2010</conf-date><fpage>1</fpage><lpage>9</lpage></citation></ref>
<ref id="b17-sensors-11-07908"><label>17.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Geer</surname><given-names>D</given-names></name></person-group><article-title>Chip makers turn to multicore processors</article-title><source>Computer</source><year>2005</year><volume>38</volume><fpage>11</fpage><lpage>13</lpage></citation></ref>
<ref id="b18-sensors-11-07908"><label>18.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Ohara</surname><given-names>S</given-names></name><name><surname>Suzuki</surname><given-names>M</given-names></name><name><surname>Saruwatari</surname><given-names>S</given-names></name><name><surname>Morikawa</surname><given-names>H</given-names></name></person-group><article-title>A prototype of a multi-core wireless sensor node for reducing power consumption</article-title><conf-name>Proceedings of the 2008 International Symposium on Applications and the Internet</conf-name><conf-loc>Washington, DC, USA</conf-loc><conf-date>28 July–1 August 2008</conf-date><fpage>369</fpage><lpage>372</lpage></citation></ref>
<ref id="b19-sensors-11-07908"><label>19.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Spies</surname><given-names>C</given-names></name><name><surname>Indrusiak</surname><given-names>L</given-names></name><name><surname>Glesner</surname><given-names>M</given-names></name></person-group><article-title>Comparative analysis of multitask scheduling algorithms for reconfigurable computing regarding context switches and configuration cache usage</article-title><conf-name>Proceedings of the 3rd Southern Conference on Programmable Logic, SPL ’07</conf-name><conf-loc>Mar del Plata, Argentina</conf-loc><conf-date>26–28 Feburary 2007</conf-date><fpage>239</fpage><lpage>242</lpage></citation></ref>
<ref id="b20-sensors-11-07908"><label>20.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Akyildiz</surname><given-names>IF</given-names></name><name><surname>Melodia</surname><given-names>T</given-names></name><name><surname>Chowdhury</surname><given-names>KR</given-names></name></person-group><article-title>Wireless multimedia sensor Networks: Applications and testbeds</article-title><source>Proc. IEEE</source><year>2008</year><volume>96</volume><fpage>1588</fpage><lpage>1605</lpage><pub-id pub-id-type="doi">10.1109/JPROC.2008.928756</pub-id></citation></ref>
<ref id="b21-sensors-11-07908"><label>21.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Bharadwaj</surname><given-names>V</given-names></name><name><surname>Robertazzi</surname><given-names>TG</given-names></name><name><surname>Ghose</surname><given-names>D</given-names></name></person-group><source>Scheduling Divisible Loads in Parallel and Distributed Systems</source><publisher-name>IEEE Computer Society Press</publisher-name><publisher-loc>Los Alamitos, CA, USA</publisher-loc><year>1996</year></citation></ref>
<ref id="b22-sensors-11-07908"><label>22.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bharadwaj</surname><given-names>V</given-names></name><name><surname>Ghose</surname><given-names>D</given-names></name></person-group><article-title>Divisibleload theory: A new paradigm for load scheduling in distributed systems</article-title><source>Clust. Comput</source><year>2003</year><volume>6</volume><fpage>7</fpage><lpage>17</lpage><pub-id pub-id-type="doi">10.1023/A:1020958815308</pub-id></citation></ref>
<ref id="b23-sensors-11-07908"><label>23.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Drozdowski</surname><given-names>M</given-names></name></person-group><source>Selected Problems of Scheduling Tasks in Multiprocessor Computer Systems</source><comment>Technical Report 321;</comment><publisher-name>Politechnika Poznanńska</publisher-name><publisher-loc>Pozńan, Poland</publisher-loc><year>1997</year></citation></ref>
<ref id="b24-sensors-11-07908"><label>24.</label><citation citation-type="web"><article-title>Intel Microprocessor Export Compliance Metrics</article-title><comment>Available online: <ext-link xlink:href="http://www.intel.com/support/processors/sb/cs-023143.htm" ext-link-type="uri">http://www.intel.com/support/processors/sb/cs-023143.htm</ext-link> (accessed on 3 July 2011).</comment></citation></ref>
<ref id="b25-sensors-11-07908"><label>25.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kahle</surname><given-names>JA</given-names></name><name><surname>Day</surname><given-names>MN</given-names></name><name><surname>Hofstee</surname><given-names>HP</given-names></name><name><surname>Johns</surname><given-names>CR</given-names></name><name><surname>Maeurer</surname><given-names>TR</given-names></name><name><surname>Shippy</surname><given-names>D</given-names></name></person-group><article-title>Introduction to the cell multiprocessor</article-title><source>IBM J. Res. Dev</source><year>2005</year><volume>49</volume><fpage>589</fpage><lpage>604</lpage><pub-id pub-id-type="doi">10.1147/rd.494.0589</pub-id></citation></ref>
<ref id="b26-sensors-11-07908"><label>26.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pham</surname><given-names>D</given-names></name><name><surname>Aipperspach</surname><given-names>T</given-names></name><name><surname>Boerstler</surname><given-names>D</given-names></name><name><surname>Bolliger</surname><given-names>M</given-names></name><name><surname>Chaudhry</surname><given-names>R</given-names></name><name><surname>Cox</surname><given-names>D</given-names></name><name><surname>Harvey</surname><given-names>P</given-names></name><name><surname>Harvey</surname><given-names>P</given-names></name><name><surname>Hofstee</surname><given-names>H</given-names></name><name><surname>Johns</surname><given-names>C</given-names></name><name><surname>Kahle</surname><given-names>J</given-names></name><name><surname>Kameyama</surname><given-names>A</given-names></name><name><surname>Keaty</surname><given-names>J</given-names></name><name><surname>Masubuchi</surname><given-names>Y</given-names></name><name><surname>Pham</surname><given-names>M</given-names></name><name><surname>Pille</surname><given-names>J</given-names></name><name><surname>Posluszny</surname><given-names>S</given-names></name><name><surname>Riley</surname><given-names>M</given-names></name><name><surname>Stasiak</surname><given-names>D</given-names></name><name><surname>Suzuoki</surname><given-names>M</given-names></name><name><surname>Takahashi</surname><given-names>O</given-names></name><name><surname>Warnock</surname><given-names>J</given-names></name><name><surname>Weitzel</surname><given-names>S</given-names></name><name><surname>Wendel</surname><given-names>D</given-names></name><name><surname>Yazawa</surname><given-names>K</given-names></name></person-group><article-title>Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor</article-title><source>IEEE J. Solid-State Circuits</source><year>2006</year><volume>41</volume><fpage>179</fpage><lpage>196</lpage><pub-id pub-id-type="doi">10.1109/JSSC.2005.859896</pub-id></citation></ref>
<ref id="b27-sensors-11-07908"><label>27.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Chou</surname><given-names>PA</given-names></name><name><surname>Wu</surname><given-names>Y</given-names></name><name><surname>Jain</surname><given-names>K</given-names></name></person-group><article-title>Practical network coding</article-title><conf-name>Proceedings of Allerton Conference on Communication, Control, and Computing</conf-name><conf-loc>Monticello, IL, USA</conf-loc><conf-date>20 October 2003</conf-date></citation></ref>
<ref id="b28-sensors-11-07908"><label>28.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Shojania</surname><given-names>H</given-names></name><name><surname>Li</surname><given-names>B</given-names></name></person-group><article-title>Parallelized progressive network coding with hardware acceleration</article-title><conf-name>Proceedings of 2007 the 15th IEEE International Workshop on Quality of Service</conf-name><conf-loc>Evanston, IL, USA</conf-loc><conf-date>21–22 June 2007</conf-date><fpage>47</fpage><lpage>55</lpage></citation></ref>
<ref id="b29-sensors-11-07908"><label>29.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Bisseling</surname><given-names>RH</given-names></name><name><surname>van de Vorst</surname><given-names>JGG</given-names></name></person-group><article-title>Parallel LU decomposition on a transputer network</article-title><conf-name>Proceedings of the Shell Conference on Parallel Computing</conf-name><conf-loc>Amsterdam, The Netherlands</conf-loc><conf-date>1–2 June 1988</conf-date><fpage>61</fpage><lpage>77</lpage></citation></ref>
<ref id="b30-sensors-11-07908"><label>30.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Melab</surname><given-names>N</given-names></name><name><surname>Talbi</surname><given-names>EG</given-names></name><name><surname>Petiton</surname><given-names>S</given-names></name></person-group><article-title>A parallel adaptive Gauss-Jordan algorithm</article-title><source>J. Supercomput</source><year>2000</year><volume>17</volume><fpage>167</fpage><lpage>185</lpage><pub-id pub-id-type="doi">10.1023/A:1008182404262</pub-id></citation></ref>
<ref id="b31-sensors-11-07908"><label>31.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Park</surname><given-names>K</given-names></name><name><surname>Park</surname><given-names>JS</given-names></name><name><surname>Ro</surname><given-names>WW</given-names></name></person-group><article-title>On improving parallelized network coding with dynamic partitioning</article-title><source>IEEE Trans. Parallel Distrib. Syst</source><year>2010</year><volume>21</volume><fpage>1547</fpage><lpage>1560</lpage><pub-id pub-id-type="doi">10.1109/TPDS.2010.40</pub-id></citation></ref>
<ref id="b32-sensors-11-07908"><label>32.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Arevalo</surname><given-names>A</given-names></name><name><surname>Matinata</surname><given-names>RM</given-names></name><name><surname>Pandian</surname><given-names>MR</given-names></name><name><surname>Peri</surname><given-names>E</given-names></name><name><surname>Ruby</surname><given-names>K</given-names></name><name><surname>Thomas</surname><given-names>F</given-names></name><name><surname>Almond</surname><given-names>C</given-names></name></person-group><source>Programming the Cell Broadband Engine Architecture: Examples and Best Practices</source><publisher-name>Vervante</publisher-name><publisher-loc>Springville, UT, USA</publisher-loc><year>2008</year></citation></ref>
<ref id="b33-sensors-11-07908"><label>33.</label><citation citation-type="web"><article-title>PPU &amp; SPU C/C++ Language Extension Specification</article-title><comment>Available online: <ext-link xlink:href="https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/30B3520C93F437AB87257060006FFE5E" ext-link-type="uri">https://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/30B3520C93F437AB87257060006FFE5E</ext-link> (accessed on 3 July 2011).</comment></citation></ref>
<ref id="b34-sensors-11-07908"><label>34.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Shojania</surname><given-names>H</given-names></name><name><surname>Li</surname><given-names>B</given-names></name><name><surname>Wang</surname><given-names>X</given-names></name></person-group><article-title>Nuclei: GPU-Accelerated many-core network coding</article-title><conf-name>Proceedings of IEEE INFOCOM 2009</conf-name><conf-loc>Rio de Janeiro, Brazil</conf-loc><conf-date>19–25 April 2009</conf-date><fpage>459</fpage><lpage>467</lpage></citation></ref>
<ref id="b35-sensors-11-07908"><label>35.</label><citation citation-type="web"><article-title>AltiVec Technology Programming Interface Manual</article-title><comment>Available online: <ext-link xlink:href="http://www.freescale.com/files/32bit/doc/refmanual/ALTIVECPIM.pdf" ext-link-type="uri">http://www.freescale.com/files/32bit/doc/refmanual/ALTIVECPIM.pdf</ext-link> (accessed on 3 July 2011).</comment></citation></ref>
<ref id="b36-sensors-11-07908"><label>36.</label><citation citation-type="book"><source>Intel(R) 64 and IA-32 Architectures Optimization Reference Manual</source><publisher-name>Intel Corporation</publisher-name><publisher-loc>Santa Clara, CA, USA</publisher-loc><year>2010</year></citation></ref>
<ref id="b37-sensors-11-07908"><label>37.</label><citation citation-type="book"><person-group person-group-type="author"><name><surname>Roth</surname><given-names>R</given-names></name></person-group><source>Introduction to Coding Theory</source><edition>1st ed</edition><publisher-name>Cambridge University Press</publisher-name><publisher-loc>Cambridge, UK</publisher-loc><year>2006</year></citation></ref>
<ref id="b38-sensors-11-07908"><label>38.</label><citation citation-type="web"><person-group person-group-type="author"><name><surname>Trenholme</surname><given-names>S</given-names></name></person-group><article-title>AES’ Galois field</article-title><comment>Available online: <ext-link xlink:href="http://www.samiam.org/galois.html" ext-link-type="uri">http://www.samiam.org/galois.html</ext-link> (accessed on 3 July 2011).</comment></citation></ref>
<ref id="b39-sensors-11-07908"><label>39.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Koetter</surname><given-names>R</given-names></name><name><surname>Médard</surname><given-names>M</given-names></name></person-group><article-title>An algebraic approach to network coding</article-title><source>IEEE/ACM Trans. Netw</source><year>2003</year><volume>11</volume><fpage>782</fpage><lpage>795</lpage><pub-id pub-id-type="doi">10.1109/TNET.2003.818197</pub-id></citation></ref>
<ref id="b40-sensors-11-07908"><label>40.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ho</surname><given-names>T</given-names></name><name><surname>Medard</surname><given-names>M</given-names></name><name><surname>Koetter</surname><given-names>R</given-names></name><name><surname>Karger</surname><given-names>D</given-names></name><name><surname>Effros</surname><given-names>M</given-names></name><name><surname>Shi</surname><given-names>J</given-names></name><name><surname>Leong</surname><given-names>B</given-names></name></person-group><article-title>A random linear network coding approach to multicast</article-title><source>IEEE Trans. Inf. Theory</source><year>2006</year><volume>52</volume><fpage>4413</fpage><lpage>4430</lpage><pub-id pub-id-type="doi">10.1109/TIT.2006.881746</pub-id></citation></ref>
<ref id="b41-sensors-11-07908"><label>41.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lun</surname><given-names>D</given-names></name><name><surname>Ratnakar</surname><given-names>N</given-names></name><name><surname>Medard</surname><given-names>M</given-names></name><name><surname>Koetter</surname><given-names>R</given-names></name><name><surname>Karger</surname><given-names>D</given-names></name><name><surname>Ho</surname><given-names>T</given-names></name><name><surname>Ahmed</surname><given-names>E</given-names></name><name><surname>Zhao</surname><given-names>F</given-names></name></person-group><article-title>Minimum-cost multicast over coded packet networks</article-title><source>IEEE Trans. Inf. Theory</source><year>2006</year><volume>52</volume><fpage>2608</fpage><lpage>2623</lpage><pub-id pub-id-type="doi">10.1109/TIT.2006.874523</pub-id></citation></ref>
<ref id="b42-sensors-11-07908"><label>42.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Katti</surname><given-names>S</given-names></name><name><surname>Rahul</surname><given-names>H</given-names></name><name><surname>Hu</surname><given-names>W</given-names></name><name><surname>Katabi</surname><given-names>D</given-names></name><name><surname>Medard</surname><given-names>M</given-names></name><name><surname>Crowcroft</surname><given-names>J</given-names></name></person-group><article-title>XORs in the air: Practical wireless network coding</article-title><source>IEEE/ACM Trans. Netw</source><year>2008</year><volume>16</volume><fpage>497</fpage><lpage>510</lpage><pub-id pub-id-type="doi">10.1109/TNET.2008.923722</pub-id></citation></ref>
<ref id="b43-sensors-11-07908"><label>43.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Park</surname><given-names>JS</given-names></name><name><surname>Gerla</surname><given-names>M</given-names></name><name><surname>Lun</surname><given-names>D</given-names></name><name><surname>Yi</surname><given-names>Y</given-names></name><name><surname>Medard</surname><given-names>M</given-names></name></person-group><article-title>Codecast: A network-coding-based <italic>ad hoc</italic> multicast protocol</article-title><source>IEEE Wirel. Commun</source><year>2006</year><volume>13</volume><fpage>76</fpage><lpage>81</lpage><pub-id pub-id-type="doi">10.1109/WC-M.2006.250362</pub-id></citation></ref>
<ref id="b44-sensors-11-07908"><label>44.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Gkantsidis</surname><given-names>C</given-names></name><name><surname>Rodriguez</surname><given-names>P</given-names></name></person-group><article-title>Network coding for large scale content distribution</article-title><conf-name>Proceedings of the 24th Annual Joint Conference of the IEEE Computer and Communications Societies, INFOCOM 2005</conf-name><conf-loc>Miami, FL, USA</conf-loc><conf-date>13–17 March 2005</conf-date><volume>4</volume><fpage>2235</fpage><lpage>2245</lpage></citation></ref>
<ref id="b45-sensors-11-07908"><label>45.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Wang</surname><given-names>M</given-names></name><name><surname>Li</surname><given-names>B</given-names></name></person-group><article-title>Lava: A reality check of network coding in peer-to-peer live streaming</article-title><conf-name>Proceedings of the 26th IEEE International Conference on Computer Communications, INFOCOM 2007</conf-name><conf-loc>Anchorage, AK, USA</conf-loc><conf-date>6–12 May 2007</conf-date><fpage>1082</fpage><lpage>1090</lpage></citation></ref>
<ref id="b46-sensors-11-07908"><label>46.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kang</surname><given-names>M</given-names></name><name><surname>Kang</surname><given-names>DI</given-names></name><name><surname>Crago</surname><given-names>SP</given-names></name><name><surname>Park</surname><given-names>GL</given-names></name><name><surname>Lee</surname><given-names>J</given-names></name></person-group><article-title>Design and development of a run-time monitor for multi-core architectures in cloud computing</article-title><source>Sensors</source><year>2011</year><volume>11</volume><fpage>3595</fpage><lpage>3610</lpage><pub-id pub-id-type="doi">10.3390/s110403595</pub-id><pub-id pub-id-type="pmid">22163811</pub-id></citation></ref>
<ref id="b47-sensors-11-07908"><label>47.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Lee</surname><given-names>U</given-names></name><name><surname>Park</surname><given-names>JS</given-names></name><name><surname>Yeh</surname><given-names>J</given-names></name><name><surname>Pau</surname><given-names>G</given-names></name><name><surname>Gerla</surname><given-names>M</given-names></name></person-group><article-title>Code torrent: Content distribution using network coding in VANET</article-title><conf-name>Proceedings of the 1st International Workshop on Decentralized Resource Sharing in Mobile Computing and Networking, MobiShare ’06</conf-name><conf-loc>Los Angeles, CA, USA</conf-loc><conf-date>September 2006</conf-date><fpage>1</fpage><lpage>5</lpage></citation></ref>
<ref id="b48-sensors-11-07908"><label>48.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Shojania</surname><given-names>H</given-names></name><name><surname>Li</surname><given-names>B</given-names></name></person-group><article-title>Random network coding on the iPhone: Fact or fiction?</article-title><conf-name>Proceedings of the 18th International Workshop on Network and Operating Systems Support for Digital Audio and Video, NOSSDAV ’09</conf-name><conf-loc>Braunschweig, Germany</conf-loc><conf-date>May 2009</conf-date><fpage>37</fpage><lpage>42</lpage></citation></ref>
<ref id="b49-sensors-11-07908"><label>49.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Lee</surname><given-names>U</given-names></name><name><surname>Park</surname><given-names>JS</given-names></name><name><surname>Yeh</surname><given-names>J</given-names></name><name><surname>Pau</surname><given-names>G</given-names></name><name><surname>Gerla</surname><given-names>M</given-names></name></person-group><article-title>A content distribution system based on sparse linear network coding</article-title><conf-name>Proceedings of the 3rd Workshop on Network Coding, Theory, and Applications, NetCod ’07</conf-name><conf-loc>San Diego, CA, USA</conf-loc><conf-date>January 2007</conf-date><fpage>1</fpage><lpage>6</lpage></citation></ref>
<ref id="b50-sensors-11-07908"><label>50.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Maymounkov</surname><given-names>P</given-names></name><name><surname>Harvey</surname><given-names>NJA</given-names></name></person-group><article-title>Methods for efficient network coding</article-title><conf-name>Proceedings of the 44th Annual Allerton Conference on Communication, Control, and Computing</conf-name><conf-loc>Urbana, IL, USA</conf-loc><conf-date>27 September–29 September 2006</conf-date></citation></ref>
<ref id="b51-sensors-11-07908"><label>51.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Park</surname><given-names>K</given-names></name><name><surname>Park</surname><given-names>JS</given-names></name><name><surname>Ro</surname><given-names>WW</given-names></name></person-group><article-title>Efficient parallelized network coding for P2P file sharing applications</article-title><conf-name>Proceedings of the 4th International Conference on Advances in Grid and Pervasive Computing, GPC ’09</conf-name><conf-loc>Geneva, Switzerland</conf-loc><conf-date>4–8 May 2009</conf-date><fpage>353</fpage><lpage>363</lpage></citation></ref>
<ref id="b52-sensors-11-07908"><label>52.</label><citation citation-type="confproc"><person-group person-group-type="author"><name><surname>Csanky</surname><given-names>L</given-names></name></person-group><article-title>Fast parallel matrix inversion algorithms</article-title><conf-name>Proceedings of the 16th Annual Symposium on Foundations of Computer Science, SFCS ’75</conf-name><conf-loc>Berkeley, CA, USA</conf-loc><conf-date>13–15 October 1975</conf-date><fpage>11</fpage><lpage>12</lpage></citation></ref>
<ref id="b53-sensors-11-07908"><label>53.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Drozdowski</surname><given-names>M</given-names></name><name><surname>Lawenda</surname><given-names>M</given-names></name></person-group><article-title>Multi-installment divisible load processing in heterogeneous distributed systems: Research articles</article-title><source>Concurr. Comput. Pract. Exp</source><year>2007</year><volume>19</volume><fpage>2237</fpage><lpage>2253</lpage><pub-id pub-id-type="doi">10.1002/cpe.1180</pub-id></citation></ref>
<ref id="b54-sensors-11-07908"><label>54.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cariño</surname><given-names>RL</given-names></name><name><surname>Banicescu</surname><given-names>I</given-names></name></person-group><article-title>Dynamic load balancing with adaptive factoring methods in scientific applications</article-title><source>J. Supercomput</source><year>2008</year><volume>44</volume><fpage>41</fpage><lpage>63</lpage><pub-id pub-id-type="doi">10.1007/s11227-007-0148-y</pub-id></citation></ref>
<ref id="b55-sensors-11-07908"><label>55.</label><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chu</surname><given-names>X</given-names></name><name><surname>Zhao</surname><given-names>K</given-names></name><name><surname>Wang</surname><given-names>M</given-names></name></person-group><article-title>Accelerating network coding on many-core GPUs and multi-core CPUs</article-title><source>J. Commun</source><year>2009</year><volume>4</volume><fpage>902</fpage><lpage>909</lpage></citation></ref></ref-list>
<sec sec-type="display-objects">
<title>Figures and Tables</title>
<fig id="f1-sensors-11-07908" position="float">
<label>Figure 1.</label>
<caption>
<p>The block diagram of the Cell BE architecture.</p></caption>
<graphic xlink:href="sensors-11-07908f1.gif"/></fig>
<fig id="f2-sensors-11-07908" position="float">
<label>Figure 2.</label>
<caption>
<p>Advantage of using network coding.</p></caption>
<graphic xlink:href="sensors-11-07908f2.gif"/></fig>
<fig id="f3-sensors-11-07908" position="float">
<label>Figure 3.</label>
<caption>
<p>Data encoding at the sending node.</p></caption>
<graphic xlink:href="sensors-11-07908f3.gif"/></fig>
<fig id="f4-sensors-11-07908" position="float">
<label>Figure 4.</label>
<caption>
<p>Data received at the receiving node.</p></caption>
<graphic xlink:href="sensors-11-07908f4.gif"/></fig>
<fig id="f5-sensors-11-07908" position="float">
<label>Figure 5.</label>
<caption>
<p>Processes on Stage A to Stage E; (<bold>a</bold>) During Stage A operation; (<bold>b</bold>) After Stage D operation; and (<bold>c</bold>) After Stage E operation.</p></caption>
<graphic xlink:href="sensors-11-07908f5.gif"/></fig>
<fig id="f6-sensors-11-07908" position="float">
<label>Figure 6.</label>
<caption>
<p>Parallelization algorithms of network coding on Homogeneous processor; (<bold>a</bold>) HP; (<bold>b</bold>) RRP; and (<bold>c</bold>) DVP.</p></caption>
<graphic xlink:href="sensors-11-07908f6.gif"/></fig>
<fig id="f7-sensors-11-07908" position="float">
<label>Figure 7.</label>
<caption>
<p>Dynamic resource distribution to Cell BE.</p></caption>
<graphic xlink:href="sensors-11-07908f7.gif"/></fig>
<fig id="f8-sensors-11-07908" position="float">
<label>Figure 8.</label>
<caption>
<p>Optimized loop-based multiplication of GF(2<sup>8</sup>) for GPU.</p></caption>
<graphic xlink:href="sensors-11-07908f8.gif"/></fig>
<fig id="f9-sensors-11-07908" position="float">
<label>Figure 9.</label>
<caption>
<p>The loop-based SIMD multiplication in GF(2<sup>8</sup>).</p></caption>
<graphic xlink:href="sensors-11-07908f9.gif"/></fig>
<fig id="f10-sensors-11-07908" position="float">
<label>Figure 10.</label>
<caption>
<p>Decoding time of HP, RRP, and DVP on the Cell BE with various coefficient matrix size; (a) 64 × 64; (b) 128 × 128; (c) 256 × 256; and (d) 512 × 512.</p></caption>
<graphic xlink:href="sensors-11-07908f10.gif"/></fig>
<fig id="f11-sensors-11-07908" position="float">
<label>Figure 11.</label>
<caption>
<p>Speed-up of Galois Field operation.</p></caption>
<graphic xlink:href="sensors-11-07908f11.gif"/></fig>
<fig id="f12-sensors-11-07908" position="float">
<label>Figure 12.</label>
<caption>
<p>Speed-up of decoding time compared with COMPUTE on 128 × 128 coefficient matrix size; (<bold>a</bold>) PPE; (<bold>b</bold>) SPE.</p></caption>
<graphic xlink:href="sensors-11-07908f12.gif"/></fig>
<fig id="f13-sensors-11-07908" position="float">
<label>Figure 13.</label>
<caption>
<p>Inbound mailbox synchronization.</p></caption>
<graphic xlink:href="sensors-11-07908f13.gif"/></fig>
<fig id="f14-sensors-11-07908" position="float">
<label>Figure 14.</label>
<caption>
<p>Decoding time of three algorithms which using PPE compared with only using SPEs with coefficient matrix size of 512; (<bold>a</bold>) <italic>PPE_COMPUTE</italic>; (<bold>b</bold>) <italic>PPE_TL</italic>; (<bold>c</bold>) <italic>PPE_VECTOR</italic>.</p></caption>
<graphic xlink:href="sensors-11-07908f14.gif"/></fig>
<fig id="f15-sensors-11-07908" position="float">
<label>Figure 15.</label>
<caption>
<p>Speed-up with various ppefactor; (<bold>a</bold>) <italic>PPE_COMPUTE</italic>; (<bold>b</bold>) <italic>PPE_TL</italic>; and (<bold>c</bold>) <italic>PPE_VECTOR.</italic></p></caption>
<graphic xlink:href="sensors-11-07908f15.gif"/></fig>
<fig id="f16-sensors-11-07908" position="float">
<label>Figure 16.</label>
<caption>
<p>Speed-up of the algorithms compared with the result of having factor “1” when varying coefficient matrix size; (<bold>a</bold>) 64 × 64; (<bold>b</bold>) 128 × 128; (<bold>c</bold>) 256 × 256; and (<bold>d</bold>) 512 × 512.</p></caption>
<graphic xlink:href="sensors-11-07908f16.gif"/></fig>
<fig id="f17-sensors-11-07908" position="float">
<label>Figure 17.</label>
<caption>
<p>Decoding time on real machine with varying coefficient matrix size; (<bold>a</bold>) 64 × 64; (<bold>b</bold>) 128 × 128; (<bold>c</bold>) 256 × 256; and (<bold>d</bold>) 512 × 512.</p></caption>
<graphic xlink:href="sensors-11-07908f17.gif"/></fig>
<fig id="f18-sensors-11-07908" position="float">
<label>Figure 18.</label>
<caption>
<p>Average speed-up of network coding on real machine with varying data size; (a) Intel; and (b) AMD.</p></caption>
<graphic xlink:href="sensors-11-07908f18.gif"/></fig>
<fig id="f19-sensors-11-07908" position="float">
<label>Figure 19.</label>
<caption>
<p>Speed-up of <italic>PPE_TL</italic> over <italic>PPE_VECTOR</italic> with varying data size.</p></caption>
<graphic xlink:href="sensors-11-07908f19.gif"/></fig>
<table-wrap id="t1-sensors-11-07908" position="float">
<label>Table 1.</label>
<caption>
<p>Five Stages of Progressive Decoding [<xref ref-type="bibr" rid="b28-sensors-11-07908">28</xref>].</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center" valign="middle"><bold>Stage</bold></th>
<th colspan="2" align="center" valign="middle"><bold>Procedure Description and Workload Distribution</bold></th></tr></thead>
<tbody>
<tr>
<td align="center" valign="middle"><bold>A</bold></td>
<td align="left" valign="middle">Using the previous coefficient rows, reduce the leading coefficients in the new row to zero</td>
<td align="right" valign="middle"><bold>(50.05%)</bold></td></tr>
<tr>
<td align="center" valign="middle"><bold>B</bold></td>
<td align="left" valign="middle">Find the first non-zero coefficient in the new coefficient row.</td>
<td align="right" valign="middle"><bold>(0.05%)</bold></td></tr>
<tr>
<td align="center" valign="middle"><bold>C</bold></td>
<td align="left" valign="middle">Check for linear independence with existing coefficient rows.</td>
<td align="right" valign="middle"><bold>(0.00001%)</bold></td></tr>
<tr>
<td align="center" valign="middle"><bold>D</bold></td>
<td align="left" valign="middle">Reduce the leading non-zero entry of the new row to 1.</td>
<td align="right" valign="middle"><bold>(0.38%)</bold></td></tr>
<tr>
<td align="center" valign="middle"><bold>E</bold></td>
<td align="left" valign="middle">Reduce the coefficient matrix to the reduced row-echelon form.</td>
<td align="right" valign="middle"><bold>(49.5%)</bold></td></tr></tbody></table></table-wrap>
<table-wrap id="t2-sensors-11-07908" position="float">
<label>Table 2.</label>
<caption>
<p>Experimental Environments.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th colspan="2" align="center" valign="middle"/>
<th align="center" valign="middle"><bold>Sony PlayStation3</bold></th>
<th align="center" valign="middle"><bold>Intel Quad Core</bold></th>
<th align="center" valign="middle"><bold>AMD Quad Core</bold></th></tr></thead>
<tbody>
<tr>
<td align="center" valign="middle"/>
<td align="center" valign="middle"><bold>CPU</bold></td>
<td align="center" valign="middle">Cell BE</td>
<td align="center" valign="middle">Intel Core 2 quad Q9400</td>
<td align="center" valign="middle">AMD phenom-X4 9550</td></tr>
<tr>
<td align="center" valign="middle"/>
<td align="center" valign="middle"><bold>Clock</bold></td>
<td align="center" valign="middle">3.2 GHz</td>
<td align="center" valign="middle">2.66 GHz</td>
<td align="center" valign="middle">2.2 GHz</td></tr>
<tr>
<td align="center" valign="middle"/>
<td align="center" valign="middle"><bold>RAM</bold></td>
<td align="center" valign="middle">512 MB</td>
<td align="center" valign="middle">2 GB</td>
<td align="center" valign="middle">4 GB</td></tr>
<tr>
<td align="center" valign="middle"><bold>SPEC</bold></td>
<td align="center" valign="middle"><bold>Cache</bold></td>
<td align="center" valign="middle">L1 : 32 KB</td>
<td align="center" valign="middle">L1 : 4 × 64 KB</td>
<td align="center" valign="middle">L1 : 4 × 128 KB</td></tr>
<tr>
<td align="center" valign="middle"/>
<td align="center" valign="middle"><bold>Size</bold></td>
<td align="center" valign="middle">L2 : 512 KB</td>
<td align="center" valign="middle">L2 : 2 × 3 MB</td>
<td align="center" valign="middle">L2 : 4 × 512 KB</td></tr>
<tr>
<td align="center" valign="middle"/>
<td align="center" valign="middle"/>
<td align="center" valign="middle"/>
<td align="center" valign="middle"/>
<td align="center" valign="middle">L3 : 2 MB shared</td></tr>
<tr>
<td align="center" valign="middle" rowspan="2"/>
<td align="center" valign="middle" rowspan="2"><bold>OS</bold></td>
<td align="center" valign="middle">Linux</td>
<td align="center" valign="middle">Linux</td>
<td align="center" valign="middle">Linux</td></tr>
<tr>
<td align="center" valign="middle">Yellow Dog Linux 6.1</td>
<td align="center" valign="middle">Fedora Core7</td>
<td align="center" valign="middle">Fedora Core8</td></tr>
<tr>
<td align="center" valign="middle"/>
<td align="center" valign="middle"><bold>Number of Cores</bold></td>
<td align="center" valign="middle">(1 + 6)</td>
<td align="center" valign="middle">4</td>
<td align="center" valign="middle">4</td></tr></tbody></table></table-wrap>
<table-wrap id="t3-sensors-11-07908" position="float">
<label>Table 3.</label>
<caption>
<p>Speed-up compared Equally Distributed Decoding.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center" valign="middle"/>
<th align="center" valign="middle"><bold>COMPUTE</bold></th>
<th align="center" valign="middle"><bold>TL</bold></th>
<th align="center" valign="middle"><bold>VECTOR</bold></th></tr></thead>
<tbody>
<tr>
<td align="center" valign="middle"><bold>Optimal factor</bold></td>
<td align="center" valign="middle">0.32</td>
<td align="center" valign="middle">0.88</td>
<td align="center" valign="middle">2.38</td></tr>
<tr>
<td align="center" valign="middle"><bold>Maximum speed-up</bold></td>
<td align="center" valign="middle">2.15</td>
<td align="center" valign="middle">1.42</td>
<td align="center" valign="middle">1.26</td></tr>
<tr>
<td align="center" valign="middle"><bold>Average speed-up</bold></td>
<td align="center" valign="middle">1.59</td>
<td align="center" valign="middle">1.03</td>
<td align="center" valign="middle">1.08</td></tr></tbody></table></table-wrap>
<table-wrap id="t4-sensors-11-07908" position="float">
<label>Table 4.</label>
<caption>
<p>Comparison of Homogeneous Processors.</p></caption>
<table frame="hsides" rules="groups">
<thead>
<tr>
<th align="center" valign="middle"/>
<th align="center" valign="middle"/>
<th align="center" valign="middle"><bold>COMPUTE</bold></th>
<th align="center" valign="middle"><bold>TL</bold></th>
<th align="center" valign="middle"><bold>VECTOR</bold></th></tr></thead>
<tbody>
<tr>
<td align="center" valign="middle" rowspan="2"><bold>Intel</bold></td>
<td align="center" valign="middle"><bold>Maximum speed-up</bold></td>
<td align="center" valign="middle">1.80</td>
<td align="center" valign="middle">1.90</td>
<td align="center" valign="middle">2.19</td></tr>
<tr>
<td align="center" valign="middle"><bold>Average speed-up</bold></td>
<td align="center" valign="middle">1.05</td>
<td align="center" valign="middle">1.27</td>
<td align="center" valign="middle">1.36</td></tr>
<tr>
<td align="center" valign="middle" rowspan="2"><bold>AMD</bold></td>
<td align="center" valign="middle"><bold>Maximum speed-up</bold></td>
<td align="center" valign="middle">2.71</td>
<td align="center" valign="middle">3.00</td>
<td align="center" valign="middle">3.12</td></tr>
<tr>
<td align="center" valign="middle"><bold>Average speed-up</bold></td>
<td align="center" valign="middle">1.77</td>
<td align="center" valign="middle">2.19</td>
<td align="center" valign="middle">2.31</td></tr></tbody></table></table-wrap></sec></back></article>
