^{1}

^{2}

^{1}

^{★}

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

While network coding is well known for its efficiency and usefulness in wireless sensor networks, the excessive costs associated with decoding computation and complexity still hinder its adoption into practical use. On the other hand, high-performance microprocessors with heterogeneous multi-cores would be used as processing nodes of the wireless sensor networks in the near future. To this end, this paper introduces an efficient network coding algorithm developed for the heterogenous multi-core processors. The proposed idea is fully tested on one of the currently available heterogeneous multi-core processors referred to as the Cell Broadband Engine.

Network coding is a new coding technique first proposed by Ahlswede

In fact, use of network coding techniques to various real world applications has been introduced [

While network coding has several advantages and is a promising technique for the future of network systems, one crucial drawback is the associated volume of computational overhead, which may hinder its adoption in practical use. Network coding requires encoding the data before it is sent and decoding it after it is received. However, the decoding algorithm has ^{3}) computational complexity, using a variant of Gaussian elimination where

On the other hand, multi-core processors have recently become widespread and can be found in a variety of systems [

In this paper, we present a parallel algorithm of network coding for heterogeneous multi-core processors especially targeting to utilize the technique in WSN. We select the already available heterogeneous platform, the Cell BE, as a prototype of heterogeneous multi-core processors and adjust the workload distribution on each core for efficient network coding. The Cell BE is a heterogeneous multi-core processor designed to provide both generality and intensive computing power with the single instruction multiple data (SIMD) paradigm. Therefore, the design of Cell BE lends itself well to the adoption of SIMD which can be efficiently utilized in wireless multimedia sensor networks [

In fact, using the Cell BE processor in sensor nodes may not be so desirable due to the size and power consumption. However, the main target of this paper is to show the efficient parallel algorithms of the network coding on heterogeneous processors and demonstrate the possible advantages and feasibility of the algorithm. We formulate an appropriate load balancing method to achieve this, which is based on the concept of divisible load theory (DLT), which was initially introduced by Bharadwaj

Via real machine experiments, we demonstrate that the proposed technique delivers improvements in decoding speed. With proper load balancing, we achieve a maximum speed-up of 2.15, compared to the performance results without load balancing. In addition, we compare our idea to the results obtained in two homogenous multi-core processors which provide competitive computing power. Compared to the Intel quad-core system, our approach achieves a maximum speed increase of 2.19, with 1 MB of data and a coefficient matrix of size 64 × 64. When we compare our performance to that of an AMD processor, we observe a maximum speed-up of 3.12 for 128 KB of data and a coefficient matrix of size 64 × 64.

The rest of this paper is organized as follows. We describe the network coding theory and brief overview of the Cell BE architecture in Section 2. Then, we propose parallelized network coding implementations for use on the Cell BE, as well as an extension to the SIMD instruction set in Section 3. In Section 4, experimental performance results are presented and analyzed. In Section 5, related works are explained. Finally, we conclude the paper in Section 6.

In this section, we will first introduce the overview of the Cell BE. In addition, some necessary knowledge on the concept of network coding will be presented.

The Cell Broadband Engine (Cell BE) is a heterogeneous multiprocessor that was developed by Sony, IBM, and Toshiba in 2000. Although it has been long time from the first release of the Cell BE, it has 256 GFLOPS (Giga Floating Operation Per Second). It still provides good performance as a single chip processor compared to one of today’s high-performance commercial processors, Intel Core i7 series (Intel Core i7 975 has theoretical performance 221.44 GFLOPS) [

The PPE is a dual-threaded, dual-in-order issue 64 bits Power-architecture processor. It has a 32 KB instruction cache and a 32 KB data cache, as well as a 512 KB L2 cache. In addition, the PPE has an

The SPEs are composed of a Synergistic Processor Unit (SPU), a 256 KB local store, and a Memory Flow Control (MFC). The execution performance of the SPEs affects much of the overall computational performance of the Cell BE. The SPU contains a 128 bit wide dual-issue SIMD unit fully pipelined to all precisions, with the exception of the double precision vector unit. The SPE can access main storage with an effective address (EA) translation by MFC and asynchronously transfer data to local storage, which has both narrow (128 bits) and wide (128 bytes) features.

The Element Interconnect Bus (EIB) is a coherent bus that can transfer up to 96 bytes/s. It consists of four 16 bytes rings, each of which is only capable of unidirectional data transfer, clockwise or counter-clockwise, each ring supporting up to three simultaneous data transfers. The Cell BE employs dual channel

We will introduce the principles and advantages of using network coding in this subsection.

Each directed edge represents a pathway for information transfer. Node

Let us assume that we have generated data bits

At the edge spanning nodes

When using network coding, we are able to generate new data by first encoding

To fully leverage the potential benefits of the network coding technique in a practical system, the encoding and decoding operations must be fast enough (

A given segment of data, such as a single file, will be divided into a specific number of blocks, referred to as _{k}^{th}_{i}_{i}_{i}_{i}_{,1}, ..., _{i}_{,}_{n}

While the packets are being routed, the packets are re-encoded within nodes along the pathways to their destinations before being passed to downstream nodes. When a packet arrives at its destination node, it is stored in local memory so the coded data can be decoded and recovered to the original data set [_{1}, ..., _{n}^{T}. To decode encoded data, the destination node must have all ^{−1}_{i}

Using a variant of

Let ^{3}). However in the decoding process associated with network coding, there is an extra matrix of size _{i}^{2}).

An additional peer within a file swarming system can reduce download delay by ^{3}, offsets the reduction in download delay, thus the benefit is canceled out. Therefore, in order to achieve some measure of benefit from a large

A variety of decoding methods that employ the random linear network coding technique is based on matrix inversion algorithms [

The traditional matrix inversion algorithms require a complete matrix to perform the decoding operation; this results in additional delays due to the waiting period. In contrast, progressive decoding requires only one row of the matrix to proceed with decoding. As such, progressive decoding is more suitable to network environments that are subject to long transmission delays.

The decoding process for traditional matrix inversion algorithms can be expressed with a computational complexity of ^{3}), after the last row has arrived. However, with the progressive decoding we can initiate the decoding process when as each row is received. Since we have already finished computation of all prior rows, the most recent row can be processed with complexity of ^{2}). In our evaluation, we employs progressive decoding to implement parallel decoding algorithms on the Cell BE.

Shojania and Li were the first to demonstrate the effectiveness of parallelization in network coding with their

_{th}_{th}

After the operations of

In the network coding research conducted previously by Shojania and Li [

In the previous work [

For an efficient decoding operation, we first distribute the computational region as shown in

The mailbox system is designed for each SPE and implemented with an asymmetric manner; both the

Synchronization can be achieved by using the outbound mailbox entry in the following manner. Each SPE writes a mail in the outbound entry and continuously checks whether the PPE reads the mail and makes the outbound entry empty. At receiving all mails from the SPEs, the synchronization is achieved. This also implies that the PPE is responsible to wait until it receives all the mails. After the synchronization is guaranteed, each SPE waits a reply which contains a pivot element from the PPE.

On the other hand, we also propose to use the inbound mailbox solely for synchronization, which provides better performance with simple implementation. The PPE transfers the pivot element to the inbound mailbox entry and the SPE continuously checks until the pivot element is completely transferred. In this way, we can simply eliminate the necessity of synchronization messages from the SPE side. This is possible due to the fact that any stalled reading operation with an empty inbound entry can be used for the synchronization purpose.

In this subsection, we propose our approach which enables an optimized workload distribution on the Cell BE.

For that purpose, we first have defined a value called

Once the workload distribution over PPE and SPE is defined, the data partitioning to use the SIMD instructions should be defined. Although the data computation region is dynamically partitioned by DVP, architectural optimization can be achieved as the Cell BE processor supports those SIMD instruction set. The SIMD instructions for PPE and SPEs enables 128 bit operations. For that reason, the data are divided into chunks each of which is as large as 16 bytes. When the size of data is not a multiple of 16, the remainder is assigned to PPE. For example, When a data size is 117 bytes, each chunk is constructed from the right most element in the data (the right most column). This means that there exist 7 chunks and remaining 5 bytes which are the left most 5 bytes. Then, the five bytes are assigned to one of the PPE threads and remaining 7 bytes are assigned to the other PPE thread and SPEs. This method is superior to the method in which each core has an equal number of elements.

After addressing the workload distribution on each thread, we need to select a proper computation method between the table-based approach and the loop-based approach for Galois field multiplications. We now explain these two approaches in Section 3.3 and the selected method is then fully tested in Section 4.2.

The random linear network coding uses the Galois field numbers and accompanies computation overhead due to the time-consuming multiplication operations. In this subsection, we propose an optimization technique of Galois field operation which is previously proposed for GPU [

Increasing granularity of the Galois multiplication is hard to expand when using a table look-up method [

To provide sufficient granularity of the multiplications, Shojania

In the previous work [^{8}).

They successfully applied the loop-based multiplication on the multiple scalar processors on a GPU which is depicted in

Although they highly optimized the loop-based multiplication method by reducing diversity of control flow on branch instructions, there still exists possible reduction of one more branch instruction. Since the branch instruction causes stalls within a pipeline, any branch instruction in a loop crucially degrades performance of the Galois field multiplication; in fact, the speed of the Galois field multiplication highly affects the performance of network coding. For that purpose, we target to removes the remaining branch within the loop represented as (3) in

The proposed Galois field multiplication based on the SIMD instruction set is shown in

An SPE calculates 80 KB within 211

In this section, we first evaluate the previous parallelized network coding algorithm developed for the homogeneous multi-core processors on the Cell BE; we simply translate the previous approach to the SIMD instruction set of the Cell BE. Then, we compare the multiplication methods which are table-based, loop-based, and using SIMD instruction set multiplication. Further, we compare parallelized decoding performance of applying the specific multiplication methods on PPE and SPEs. We also evaluate partitioning of PPE workload applying the three multiplication methods adaptively, using

We evaluate the previously proposed algorithms for homogeneous multi-core processors, HP, RRP, and DVP on the Cell BE architecture. Firstly, these algorithms are implemented with using only SPEs and SIMD instruction set for the SPE.

The maximum performance difference between HP and RRP is only 1.69% and on average, there is only 0.04% difference. In addition, DVP shows a maximum 31% enhancement over HP and RRP. Therefore, in the next section, we perform the remaining experiments using DVP. As in the homogenous multi-core processor, the advantage of DVP in terms of load balancing brings better results. Detailed explanation on DVP can be found in [

The difference between Horizontal Partitioning (HP) and Row by Row Partitioning (RRP) comes from the different manner by which row is distributed to the different cores in

In this subsection, we evaluate the decoding performance of each Galois field multiplication method. For the analysis, we choose to use the 128 bit SIMD instruction set to parallelize the Galois field multiplications.

Let

In particular, the

In

Despite of the low performance on small data size of SPE, SPE represents similar speed-ups when data size becomes large. Since SPE should be controlled by PPE to synchronize the decoding process between SPEs and transfer data from main memory, the SPE shows lower performance with small data size when the synchronization and data transfer overhead charges large proportion.

From the results in

In Section 3.1, we introduce an efficient way to implement synchronization with the asymmetric mailbox system. With the inbound mailbox, the cores can synchronize at each decoding steps and can share values in the pivot column at once. We have compared decoding speed of the two synchronization methods based on inbound mailbox and outbound mailbox respectively in

In experimental results, the synchronization method, which combines synchronization and the data transfer, reduces more than 10% of decoding time. COMPUTE and TL show remarkable reduced results since the three methods already have severe synchronization overhead by unfairness of workload distribution which does not consider the computation capability different types of cores. Consequently, the synchronization with inbound mailbox systems reduces performance degradation by inefficient synchronization methods and the performance degradation caused by absence of proper workload distribution. The performance improvement by well balanced workload is tested in the next subsection.

In Section 3.2, we explained the different factors that must be considered in determining workload distribution for the PPE and the SPEs. We have examined three multiplication methods on the PPE and compared result of each method to the performance achieved with utilizing only the SPEs. The performance results are depicted in

In this section, we propose an approach to the factorization of workload between cores, and we evaluate the decoding time when varying distribution factor, which we refer to as

In order to parallelize the progressive decoding algorithm across multiple cores, it is necessary to have a synchronization barrier that blocks excessive progression by any one particular thread. Synchronization employing a barrier greatly decreases performance when load balancing results in uneven distribution between cores. Thus, we evaluate the sensitivity of the three algorithms with respect to

In

We compare the performance results of our factorized parallelization, to the results obtained using a

In

Ahlswede

The applications of network coding have been proposed in [

In addition, Lee

Parallelized network coding was first suggested by Shojania and Li [

Many algorithms have been proposed to parallelize matrix calculation, such as the parallelization of matrix inversion [

Approaches to enhancing the performance of the progressive decoding were proposed in

The load-balancing problem has been emphasized in divisible load theory [

In this paper, we introduced an efficient random linear network coding algorithm with an appropriate load balancing method for a heterogeneous multi-core processor. We especially designed the proposed architecture considering the wireless sensor network environment. Our algorithm introduced a proper load balancing method and a hybrid progressive decoding algorithm considering different computing capability of cores. We achieve a maximum speed increase by selectively using multiplication algorithms that are (1) table-based in dealing with small coefficient and data sizes and (2) parallelized and employing SIMD instructions in dealing with large coefficient sizes as shown in

We compared performance of the proposed approach to one of the fastest progressive decoding algorithms, executed on homogeneous processors. From this comparison, we demonstrated improved performance results using our method.

This work was supported by the Korea Research Foundation Grant funded by the Korean Government (KRF-2008-313-D00871).

The block diagram of the Cell BE architecture.

Advantage of using network coding.

Data encoding at the sending node.

Data received at the receiving node.

Processes on Stage A to Stage E; (

Parallelization algorithms of network coding on Homogeneous processor; (

Dynamic resource distribution to Cell BE.

Optimized loop-based multiplication of GF(2^{8}) for GPU.

The loop-based SIMD multiplication in GF(2^{8}).

Decoding time of HP, RRP, and DVP on the Cell BE with various coefficient matrix size; (a) 64 × 64; (b) 128 × 128; (c) 256 × 256; and (d) 512 × 512.

Speed-up of Galois Field operation.

Speed-up of decoding time compared with COMPUTE on 128 × 128 coefficient matrix size; (

Inbound mailbox synchronization.

Decoding time of three algorithms which using PPE compared with only using SPEs with coefficient matrix size of 512; (

Speed-up with various ppefactor; (

Speed-up of the algorithms compared with the result of having factor “1” when varying coefficient matrix size; (

Decoding time on real machine with varying coefficient matrix size; (

Average speed-up of network coding on real machine with varying data size; (a) Intel; and (b) AMD.

Speed-up of

Five Stages of Progressive Decoding [

Using the previous coefficient rows, reduce the leading coefficients in the new row to zero | ||

Find the first non-zero coefficient in the new coefficient row. | ||

Check for linear independence with existing coefficient rows. | ||

Reduce the leading non-zero entry of the new row to 1. | ||

Reduce the coefficient matrix to the reduced row-echelon form. |

Experimental Environments.

Cell BE | Intel Core 2 quad Q9400 | AMD phenom-X4 9550 | ||

3.2 GHz | 2.66 GHz | 2.2 GHz | ||

512 MB | 2 GB | 4 GB | ||

L1 : 32 KB | L1 : 4 × 64 KB | L1 : 4 × 128 KB | ||

L2 : 512 KB | L2 : 2 × 3 MB | L2 : 4 × 512 KB | ||

L3 : 2 MB shared | ||||

Linux | Linux | Linux | ||

Yellow Dog Linux 6.1 | Fedora Core7 | Fedora Core8 | ||

(1 + 6) | 4 | 4 |

Speed-up compared Equally Distributed Decoding.

0.32 | 0.88 | 2.38 | |

2.15 | 1.42 | 1.26 | |

1.59 | 1.03 | 1.08 |

Comparison of Homogeneous Processors.

1.80 | 1.90 | 2.19 | ||

1.05 | 1.27 | 1.36 | ||

2.71 | 3.00 | 3.12 | ||

1.77 | 2.19 | 2.31 |