SoC-Based Edge Computing Gateway in the Context of the Internet of Multimedia Things: Experimental Platform

: This paper presents an algorithm/architecture and Hardware/Software co-designs for implementing a digital edge computing layer on a Zynq platform in the context of the Internet of Multimedia Things (IoMT). Traditional cloud computing is no longer suitable for applications that require image processing due to cloud latency and privacy concerns. With edge computing, data are processed, analyzed, and encrypted very close to the device, which enable the ability to secure data and act rapidly on connected things. The proposed edge computing system is composed of a reconﬁgurable module to simultaneously compress and encrypt multiple images, along with wireless image transmission and display functionalities. A lightweight implementation of the proposed design is obtained by approximate computing of the discrete cosine transform (DCT) and by using a simple chaotic generator which greatly enhances the encryption efﬁciency. The deployed solution includes four conﬁgurations based on HW/SW partitioning in order to handle the compromise between execution time, area, and energy consumption. It was found with the experimental setup that by moving more components to hardware execution, a timing speedup of more than nine times could be achieved with a negligible amount of energy consumption. The power efﬁciency was then enhanced by a ratio of 7.7 times.


Context
Today, the Internet of Things (IoT) is no longer a strategic initiative.It is a reality for major industry, government, and consumer sectors.With an estimated 15 billion connected objects and an annual growth of 17%, the IoT are rapidly becoming part of our everyday lives.The most basic architecture of the IoT is composed of three layers: perception, network, and application.With this architecture, captured data are transmitted to the cloud to be analyzed in order to create added value.It is in this way that more than 90% of big data has been generated since 2011, with more than 2/3 of big data being multimedia [1].However, the cloud cannot support and analyze the continuously increasing quantity of data.Cloud computing is no longer suitable for applications that require video analytics due to the long data transmission latency and privacy concerns.A new paradigm often referred as edge computing (fog computing, near sensor computing) has emerged over the past few years to overcome these problems.Shi, et al. define edge computing as enabling technologies "allowing computation to be performed at the edge of the network, on downstream data on behalf of cloud services and upstream data on behalf of IoT services" [2].The goal of edge computing is to process data very close to the device, and send meaningful information only to higher layers.According to the International Data Corporation (IDC) [3], it is expected that by 2019 at least 40% of IoT-created data will be stored, processed, analyzed, and acted upon close to, or at the edge of, the network.
The Internet of Multimedia Things (IoMT) can be considered as a class of the IoT.The IoMT is composed of smart things equipped with the capability to observe and/or interact with the physical environment [4].Today, for smart city applications, it is difficult to imagine an IoT architecture without multimedia things.Contrary to classical IoT devices, where sensors acquire and transmit information with low data rates, the use of multimedia things implies new challenges, such as high data rates, noise resilience and power efficiency.Hence, edge computing becomes more interesting due to the huge quantity of data generated by the video sensors.By extending the cloud where the things are (pre-processing data very close to the camera), edge computing enables us to avoid loading the network with some useless data, act rapidly on connected things (without cloud latency), and secure things [5].The European Telecommunications Standards Institute (ETSI) has established many use cases for mobile edge computing.One interesting use case concerns video analysis, where the cameras can send high bandwidth video streams to a particular node (gateway) positioned at the edge of the network.The last perform video analysis and send low bandwidth streams to the network core [6], as depicted in Figure 1.
"allowing computation to be performed at the edge of the network, on downstream data on behalf of cloud services and upstream data on behalf of IoT services" [2].The goal of edge computing is to process data very close to the device, and send meaningful information only to higher layers.According to the International Data Corporation (IDC) [3], it is expected that by 2019 at least 40% of IoT-created data will be stored, processed, analyzed, and acted upon close to, or at the edge of, the network.
The Internet of Multimedia Things (IoMT) can be considered as a class of the IoT.The IoMT is composed of smart things equipped with the capability to observe and/or interact with the physical environment [4].Today, for smart city applications, it is difficult to imagine an IoT architecture without multimedia things.Contrary to classical IoT devices, where sensors acquire and transmit information with low data rates, the use of multimedia things implies new challenges, such as high data rates, noise resilience and power efficiency.Hence, edge computing becomes more interesting due to the huge quantity of data generated by the video sensors.By extending the cloud where the things are (pre-processing data very close to the camera), edge computing enables us to avoid loading the network with some useless data, act rapidly on connected things (without cloud latency), and secure things [5].The European Telecommunications Standards Institute (ETSI) has established many use cases for mobile edge computing.One interesting use case concerns video analysis, where the cameras can send high bandwidth video streams to a particular node (gateway) positioned at the edge of the network.The last perform video analysis and send low bandwidth streams to the network core [6], as depicted in Figure 1.

Related Work
Mobile edge computing has several benefits compared to the traditional cloud-based computing paradigm [7].It has been found that for running face recognition applications, the response time is reduced from 900 to 169 ms by moving computation from the cloud to the edge [8].Moreover, for some other applications, such as wearable cognitive assistance, the response time is reduced and the energy consumption could also be reduced by 30% to 40% [9].Chun, et al. in [10] have addressed cloud offloading in terms of energy/performance trade-off.Their prototype could reduce by 20× the running time and energy for tested applications.In terms of application use cases, behavioral analytics based on a multimedia network is one of the most common applications, where the behavioral analysis (a vehicle traveling in the wrong direction) is carried out in the camera node [7].Also, in telemedicine, new IoMT frameworks have been developed as in [11] to empower patients via health promotion, assessing reductions in health deficits and improvements in quality of life.
However, design constraints related to the perception layer become stronger and should be taken into account when designing IoT applications in the context of edge computing.Two main issues should be considered when designing IoMT applications.The first one concerns the algorithmic level.Indeed, image processing algorithms are often sophisticated and could be

Related Work
Mobile edge computing has several benefits compared to the traditional cloud-based computing paradigm [7].It has been found that for running face recognition applications, the response time is reduced from 900 to 169 ms by moving computation from the cloud to the edge [8].Moreover, for some other applications, such as wearable cognitive assistance, the response time is reduced and the energy consumption could also be reduced by 30% to 40% [9].Chun, et al. in [10] have addressed cloud offloading in terms of energy/performance trade-off.Their prototype could reduce by 20× the running time and energy for tested applications.In terms of application use cases, behavioral analytics based on a multimedia network is one of the most common applications, where the behavioral analysis (a vehicle traveling in the wrong direction) is carried out in the camera node [7].Also, in telemedicine, new IoMT frameworks have been developed as in [11] to empower patients via health promotion, assessing reductions in health deficits and improvements in quality of life.
However, design constraints related to the perception layer become stronger and should be taken into account when designing IoT applications in the context of edge computing.Two main issues should be considered when designing IoMT applications.The first one concerns the algorithmic level.Indeed, image processing algorithms are often sophisticated and could be simplified.In this level, a selection of algorithms with lightweight implementation is mandatory.Then, to further reduce the algorithmic computational complexity, the use of approximate computing as in [12,13] can be beneficial.The second issue is regarding the architectural design.For high performance edge computing systems, a field programmable gate array (FPGA) can be used as a key component that can implement customized hardware logic and perform image processing tasks in real-time.Latest generation FPGAs provide small, low-cost and low-power devices that are ideal for edge computing applications.
Most of the state-of-the-art methods developed recently are dedicated exclusively to the software or hardware implementation of a single node, often called a smart camera.Regarding the software implementation, in [14], an autonomous smart INCA+ camera has been designed on a software processor for face detection.Another software implementation of smart cameras is described in [15], where a real-time object tracking system is developed to transmit only high-level information.In [16,17], a GPU implementation of a face recognition scheme using spectral correlation has been developed and a speedup of 4× was obtained compared to implementation based on a generic processor.On the other hand, the hardware implementation is often based on FPGA.Today, FPGAs are fabricated as system-on-chip (SoC) platforms containing hardware and software sections on the same chip.The hardware part is used when high speed and parallel computation are required, while the software is employed to add more flexibility and connectivity to the designed system.The use of platforms based on SoC architecture have gained particular attention [18][19][20][21].However, the existing platforms in [18,19] do not use FPGA as the hardware accelerator of computationally complex algorithms and those of [18][19][20][21] do not consider the HW/SW co-design to evaluate the energy/quality of the designed systems.In [22], a smart camera node was designed in order to avoid falls by elderly persons in daily life.The designed prototype uses the HW/SW implementation for fall detection in a clock speed which is comparable to real-time systems.However, this implementation is dedicated to a single node and no communication functions were planned.

Objective of the Work
The objective of this paper is to build a proof-of-concept platform of edge computing for IoMT based on HW/SW co-design.The platform is composed of a camera sensor, a flexible SoC for image processing, and a communication function between nodes.Each IoMT node can be considered as a receiver or emitter by using a dedicated protocol.The goal of this work is to show how the use of the algorithm/architecture co-design can be useful for IoMT applications.For algorithmic issues, we started from the theory (approximate computing, chaos generation, fusion and diffusion schemes and so on) and finally obtained an experimental platform that can be used as a gateway in the context of the IoMT.For the architectural level, we used the Zynq design flow to speed up the conception time and we used FPGA as the hardware accelerator.We designed a digital gateway that can be used on the edge of the network in order to compress and encrypt multiple images.We take the simultaneous compression and encryption of multiple images as the use case.Compression is necessary for all image applications that require data transmission, as in the IoMT, while image encryption has become more and more important in the last few years.The proposed platform could be considered the cornerstone of IoMT applications, including video surveillance, face recognition, pedestrian detection, etc.Hence, this paper presents some details of an open and experimental platform that can be shared by the community in order to design image and video processing applications.The proposed platform contains the following processing: image acquisition (here images are stored in memories), compression, encryption and transmission.By algorithmic and architectural optimizations, we decreased the constraints related to time and energy consumption while maintaining at the same time an acceptable quality of experience.On the algorithmic level, we considered the approximate computing of the DCT used in the compression, and we employed a simple chaotic generators for image encryption.While, the architectural optimization is provided by hardware and software partitioning of the hole system.Four configurations of the system portioning were considered so that the trade-off time/energy can correspond to very broad range of applications.Finally, we should note that the presented design is flexible and can include other treatments such as face recognition, motion detection, and identification.
The rest of the paper is organized as follows: in Section 2, we detail the algorithmic considerations for IoMT applications.Section 3 is devoted to the hardware considerations.The HW/SW portioning is analyzed in Section 4, where four configurations are illustrated.The experimental setup and the performance analysis are given in Section 5, before the conclusion.

Need for Image Compression and Encryption
Today, multimedia content is rich and varied with the introduction of new formats, such as 4 K, 8 K, high dynamic range (HDR), wide color gamut (WCG), high frame rate (HFR)), and with the emergence of new applications such as virtual reality, video 360, etc. Coding, decoding, and down/uploading multimedia content are the most commonly used functionalities for any IoMT application.The low-power realization of these functions in mobile and portable things is therefore important for a reasonably good battery life.The total power consumption to run these applications comprises two major parts: (1) the power consumption in compression (or decompression), which is related to the video codec complexity, and (2) the power consumed in transmitting or downloading images, which is a function of the bit rate and depends on the compression efficiency of the codec.When considering the IoMT, the power consumed in compression and decompression is becoming increasingly dominant, since the transmission range is short.Therefore, there is a need to compress multimedia content with a relatively simple algorithm.Moreover, there is a need to protect multimedia content.A viable solution to simultaneously compress and encrypt multiple images has been presented in [23].The method is based on the discrete cosine transform (DCT) and a specific spectral filtering.A set of n images was considered, as illustrated in Figure 2.Each image is transformed in the spectral domain by DCT, then filtered in order to keep a few low-frequency components.The latter are fused in a random manner to constitute a compressed and ciphered image.An improved version of the original work has been proposed in [24], where the size of the spectral block was adapted to the target image.The size adaptation is performed by applying the root-mean-square (RMS) time-frequency criterion.The straightforward application of this method in the context of edge computing suffers from two major drawbacks.The first concerns the high computational time of the compression process and the secret key calculations.The second is related to the low resistance against statistical and differential attacks, since the encryption scheme consists of shuffling spectral components of the compressed images without any substitution.
Finally, we should note that the presented design is flexible and can include other treatments such as face recognition, motion detection, and identification.
The rest of the paper is organized as follows: in Section 2, we detail the algorithmic considerations for IoMT applications.Section 3 is devoted to the hardware considerations.The HW/SW portioning is analyzed in Section 4, where four configurations are illustrated.The experimental setup and the performance analysis are given in Section 5, before the conclusion.

Need for Image Compression and Encryption
Today, multimedia content is rich and varied with the introduction of new formats, such as 4 K, 8 K, high dynamic range (HDR), wide color gamut (WCG), high frame rate (HFR)), and with the emergence of new applications such as virtual reality, video 360, etc. Coding, decoding, and down/uploading multimedia content are the most commonly used functionalities for any IoMT application.The low-power realization of these functions in mobile and portable things is therefore important for a reasonably good battery life.The total power consumption to run these applications comprises two major parts: (1) the power consumption in compression (or decompression), which is related to the video codec complexity, and (2) the power consumed in transmitting or downloading images, which is a function of the bit rate and depends on the compression efficiency of the codec.When considering the IoMT, the power consumed in compression and decompression is becoming increasingly dominant, since the transmission range is short.Therefore, there is a need to compress multimedia content with a relatively simple algorithm.Moreover, there is a need to protect multimedia content.A viable solution to simultaneously compress and encrypt multiple images has been presented in [23].The method is based on the discrete cosine transform (DCT) and a specific spectral filtering.A set of n images was considered, as illustrated in Figure 2.Each image is transformed in the spectral domain by DCT, then filtered in order to keep a few low-frequency components.The latter are fused in a random manner to constitute a compressed and ciphered image.An improved version of the original work has been proposed in [24], where the size of the spectral block was adapted to the target image.The size adaptation is performed by applying the root-meansquare (RMS) time-frequency criterion.The straightforward application of this method in the context of edge computing suffers from two major drawbacks.The first concerns the high computational time of the compression process and the secret key calculations.The second is related to the low resistance against statistical and differential attacks, since the encryption scheme consists of shuffling spectral components of the compressed images without any substitution.

Approximate Computing for Compression Improvement
The original method achieves compression by using DCT as many times as necessary to compress multiple images.Since DCT is computationally intensive and its complexity grows nonlinearly with the image size, in this paper we propose to use an approximation of DCT as in [12,13].The main objective of the DCT approximation is to get rid of the multiplications that consume most

Approximate Computing for Compression Improvement
The original method achieves compression by using DCT as many times as necessary to compress multiple images.Since DCT is computationally intensive and its complexity grows non-linearly with the image size, in this paper we propose to use an approximation of DCT as in [12,13].The main objective of the DCT approximation is to get rid of the multiplications that consume most of the power and computation time, and also to obtain a meaningful estimation of DCT.Indeed, the approximate DCT consists of replacing the twiddle factors of the basis vectors by {−1,0,1} elements.Hence, the multiplication operations used for the exact DCT computation are replaced by additions and subtractions, which have a lower latency and consume less energy and silicon area compared to the multiplications.It has been demonstrated that approximate DCT hugely decreases the computation requirements of the DCT and slightly decrease the quality of the reconstructed images.More specifically, 16-point 1-D approximate DCT consumes 64 adders and eight shift operators, while the exact DCT calculation requires 86 multipliers, 100 adders and eight shifters.In [13], it was found that the gain of energy per output coefficient is estimated to be equal to 16 with the approximate DCT compared to the exact one, and the gain in terms of area delay product is around 13.Moreover, the quality of experience is slightly degraded with the DCT approximation, since the PSNR of the reconstructed images with approximation is decreased by less than 0.8 dB when the compression ratio is lower than 1:2.Finally, to be compliant with the proposed method in [23], each N × N image is compressed and then low-pass filtered to get N' × N' compressed image.In this work, we considered four input images and N' = N/4, which implies a compression ratio of 16:1.It is worth noting that DCT approximation simplifies the compression process in terms of computational requirements, but has no effect on the size of the compressed bitstream.

Encryption Improvement
Traditional DES-, AES-and RSA-based encryption methods are not applicable to image encryption due to the computation speed of modern computers, and to the intrinsic properties of image like pixel correlation and voluminous file size.The chaotic generators have benefited from the growing need to secure data and at the same time they have demonstrated a great potential for image encryption.The robustness of chaotic-based cryptosystems is due to the intrinsic properties of chaotic generators like sensitivity to initial conditions and control parameters, non-periodicity, deterministic behavior and a large key space [25].With these features, chaotic-based cryptosystems have excellent properties of diffusion and confusion, which are mandatory in cryptography.More particularly, one-dimensional non-linear iterative functions are capable of showing chaotic behavior under some conditions related to the tuning parameter and initial conditions.Moreover, their lower computational complexity has facilitated their adoption in many image encryption schemes based on several chaotic maps, such as Matthews [25], logistic [26], Henon [27], Skew Tent [28], Arnold [29], and Baker [30].
The flow chart of the encryption scheme used in this work was inspired by that of [26], where an improvement of the existing method in [24] is provided by using a software-based implementation.Extensive numerical simulations have been illustrated in [31] to prove the robustness of the proposed scheme against brute force, statistical, differential, and chosen plaintext attacks.Figure 3 depicts the encryption scheme, where a and s 0 are the initial conditions of chaotic generators, which are developed in the next section.We should finally note that the approximate DCTs are from [12,13].The method of multiple image compression and encryption is from [23] and the chaotic generators are used in the literature.There is no novelty in that issues are taken separately.In this work, we have used a set of algorithmic improvements simultaneously in order to alleviate the hardware constraints of the overall scheme.It was found that this choice is beneficial in terms of encryption efficiency and that the influence on the quality of reconstructed images is negligible.

Design Flow
The approximation of discrete cosine transform (DCT) is useful for reducing its computational complexity without significant impacts on its coding performance.Most of the existing algorithms for the approximation of the DCT target only the DCT of small transform lengths, and some of them are non-orthogonal.In this work, we use an approximation of 16-point DCT.For more details about the approximation, please consult the details in [13].Hence, the double matrix multiplication used to perform the DCT is computed without multiplications.Only adders, subtracts, and shifters are required.
The design flow used with Xilinx tools includes HLS (high-level synthesis) steps performed by Vivado HLS.The goal is the fast prototyping of process elements to develop IP (intellectual property).With Vivado HLS, we transformed a C code into an HDL language.For our work, we used VHDL as the target language.The DCT needs a two-dimensional matrix multiplication applied twice.Before starting the computation of the second matrix multiplication, we need to use the results of the first matrix multiplication.Some directives are applied to the C code in order to have high-level control over the implementation.For our work, we only used three directives: INLINE off, PIPELINE and INTERFACE.The first directive (INLINE) increases potential parallelism at the cost of extra hardware.With INLINE, instead of treating the function as a single hardware unit, this directive makes HLS inline the function every time it is called.The second (PIPELINE) directive is useful for DCT calculation that uses several loops and for the matrix multiplication of two nested loops.We can apply the PIPELINE directive to loops.This causes inputs to be passed to the loop more frequently.A pipelined loop can process new inputs every M clock cycles, where M is the initiation interval (in our case, we left this interval at one).This directive increases the speed of the calculation, but also increases the number of logic elements.Finally, with the INTERFACE directive, we indicated to the

Design Flow
The approximation of discrete cosine transform (DCT) is useful for reducing its computational complexity without significant impacts on its coding performance.Most of the existing algorithms for the approximation of the DCT target only the DCT of small transform lengths, and some of them are non-orthogonal.In this work, we use an approximation of 16-point DCT.For more details about the approximation, please consult the details in [13].Hence, the double matrix multiplication used to perform the DCT is computed without multiplications.Only adders, subtracts, and shifters are required.
The design flow used with Xilinx tools includes HLS (high-level synthesis) steps performed by Vivado HLS.The goal is the fast prototyping of process elements to develop IP (intellectual property).With Vivado HLS, we transformed a C code into an HDL language.For our work, we used VHDL as the target language.The DCT needs a two-dimensional matrix multiplication applied twice.Before starting the computation of the second matrix multiplication, we need to use the results of the first matrix multiplication.Some directives are applied to the C code in order to have high-level control over the implementation.For our work, we only used three directives: INLINE off, PIPELINE and INTERFACE.The first directive (INLINE) increases potential parallelism at the cost of extra hardware.With INLINE, instead of treating the function as a single hardware unit, this directive makes HLS inline the function every time it is called.The second (PIPELINE) directive is useful for DCT calculation that uses several loops and for the matrix multiplication of two nested loops.We can apply the PIPELINE directive to loops.This causes inputs to be passed to the loop more frequently.A pipelined loop can process new inputs every M clock cycles, where M is the initiation interval (in our case, we left this interval at one).This directive increases the speed of the calculation, but also increases the number of logic elements.Finally, with the INTERFACE directive, we indicated to the HLS how to pass parameters between functions and between the PS and the PL.We applied this directive with s_axi_lite option on the top level function and on the parameters coming from the PS.

Hardware Description of the Chaotic Generators
In this paper, we proposed to use simple chaotic generators like Henon and Skew Tent maps in order to alleviate the hardware constraints on the edge computing system.The Henon map is a well-known discrete time dynamical system that exhibits chaotic behavior, and is defined by: where a and b are the bifurcation parameters of the maps, and x 0 and y 0 are the initial conditions and the traditional practice is to use x 0 = y 0 = 0.In order to achieve chaotic behavior, a and b are often taken to be equal to 1.4 and 0.3, respectively.As explained before, in this work we used b = 0.3 and a is a constant that depends on the content of input images.
The Skew Tent map, also called 1-D asymmetric tent map, is used in many cryptographic applications due its simplicity, large key space, and high key sensitivity.Its map is defined by: where s n ∈ [0, 1] is the state of the system and p ∈ (0, 1) is the control parameter.The initial condition s 0 and p can be used as secret keys.In this work, p is fixed to 0.54321.Since the chaotic sequences will be generated on FPGA, the hardware description of both generators should be provided.The proposed signal flow graph (SFG) of the Henon and Skew Tent map are presented in Figure 4.
HLS how to pass parameters between functions and between the PS and the PL.We applied this directive with s_axi_lite option on the top level function and on the parameters coming from the PS.

Hardware Description of the Chaotic Generators
In this paper, we proposed to use simple chaotic generators like Henon and Skew Tent maps in order to alleviate the hardware constraints on the edge computing system.The Henon map is a wellknown discrete time dynamical system that exhibits chaotic behavior, and is defined by: .
x y ax y bx where a and b are the bifurcation parameters of the maps, and x0 and y0 are the initial conditions and the traditional practice is to use x0 = y0 = 0.In order to achieve chaotic behavior, a and b are often taken to be equal to 1.4 and 0.3, respectively.As explained before, in this work we used b = 0.3 and a is a constant that depends on the content of input images.
The Skew Tent map, also called 1-D asymmetric tent map, is used in many cryptographic applications due its simplicity, large key space, and high key sensitivity.Its map is defined by: where is the state of the system and ( ) is the control parameter.The initial condition s0 and p can be used as secret keys.In this work, p is fixed to 0.54321.Since the chaotic sequences will be generated on FPGA, the hardware description of both generators should be provided.The proposed signal flow graph (SFG) of the Henon and Skew Tent map are presented in Figure 4.As shown in Figure 4, the proposed architecture of the chaotic generators can be obtained by simple logic and arithmetic operators, which are suitable for hardware implementation.Both generators are described in the VHDL language, but could be also described with Vivado HLS.As shown in Figure 4, the proposed architecture of the chaotic generators can be obtained by simple logic and arithmetic operators, which are suitable for hardware implementation.Both generators are described in the VHDL language, but could be also described with Vivado HLS.

Hardware Setup
Our global hardware architecture is composed of two ZedBoard Xilinx evaluation boards, two Zigbee transceiver modules, and two High-Definition Multimedia Interface (HDMI) displays used for debugging and prototype validation.The ZedBoard is based on Zynq XC7Z020.This device has a large memory (DDR3 and QPSI) and several communication interfaces (USB, Ethernet, Peripheral Module PMODs, HDMI, etc.).The ZedBoard SoC has an embedded ARM Cortex A9 processor commonly named PS (processing system) and a large FPGA fabric called PL (programmable logic).The communication between PS and PL is performed via AXI (advanced extensible interface) buses.An Analog Devices ADV7511 HDMI Transmitter provides a digital video interface to the ZedBoard.The HDMI display is connected to the SoC via HDMI port.The use of an HDMI display is inappropriate for the context of IoMT, but we have made this choice in order to add a debug functionality to the designed system.The HDMI display is split into two parts in order to show input images and de/(en)crypted images.We should notice that input images are stored in a RAM before their processing.
Image transmission is carried out using Xbee S6B wireless modules which provide simple serial to IEEE 802.11 connectivity.The Xbee wireless module is connected to the ZedBoard through the Pmod E. Sending and receiving data between the Zynq SoC and the Xbee module are done through the Universal Asynchronous Receiver/Transmitter UART transmission.The Zynq was configured such that the pins RX (reception) and TX (transmission) of the UART are connected to the Pmod E. The Pmod E is used because it is connected directly to the PS part of Zynq.
The UART frame consists of a start bit, a data byte, and a stop bit.The UART of the ZedBoard can send and receive data with a symbol rate of 921,600 bit/s.The symbol rate of the UART of Zynq is smaller than the XBee wireless module; it is the UART transmission which limits the maximum throughput.The creation of the UART connection is achieved by configuring the MIO 10 and 11 located on the PS transmission UART (named UART 0).The RX signal (input signal) of the UART is located in the MIO 10.The TX signal (output signal) of the UART is located in the MIO 11.After placement, routing and implementation of hardware design on Vivado, one can notice that MIO 10 and 11 are connected to the G7 and B4 ports, respectively.

Hardware Description
The proposed crypto compression is implemented under Xilinx Vivado tool.The device is the ZedBoard manufactured by Digilent and the hardware description language is the VHDL.The vivado block design is illustrated in Figure 5.As indicated, nine blocks are used.
(1) The Zynq processing system is the software interface around the Zynq-7000 Processing System.(2) The HDMI IP for ZedBoard from Avnet: Reference Design implements the HDMI digital video interface in Xilinx FPGAs.(3) The reconfigurable compression and encryption IP is the proposed IP including DCT and chaos according to algorithms of Section 2. The compression DCT IP was developed under vivado HLS, while the chaos generators were described in VHDL.(4) The reconfigurable decompression and decryption IP is the proposed IP including Inverse Discrete Cosine Transform (IDCT) and chaos according to the algorithms in Section 2. The decompression DCT IP was developed under vivado HLS, while the chaos generators were described in VHDL.(5) The Axi GPIO (led, switch, push button) for the application: The switches select an index for the secret key, leds display the index, and push buttons select the emitter/receiver or software/hardware solutions.(6) The iic output of the HDMI IP to be used for the initialization.(7) The AXI peripheral block provides connections between the PL and PS.(8) The reset processing system to asynchronously restart the system.(9) The clock generation of 148.5 MHz for the HD video display.
The proposed compression/decompression modules have reconfigurable architectures.They allow the computation of 32-, 16-and 8-point DCT/IDCT by using the same computation elements.This is possible thanks to our proposed algorithm in [12] which was validated in [13].

Hard/Soft Co-Design
The processing chains for transmitting and receiving crypto-compressed images are illustrated at the top of Figure 6.The transceiver protocol of the Xbee modules is implemented on the PS side, while the HDMI display is carried out on the PL by using the Xilinx IP for HDMI.The rest of the processing blocks can be implemented on software or on hardware.Therefore, we explored the HW/SW co-design of these blocks by creating four design variants that each implement a different configuration.The reason for creating several configurations is to move processing blocks from PS to PL and to quantify the system performances obtained with each configuration.All of these considered configurations are shown in Figure 7a,b for the transmission and reception chain, respectively.For the transmission chain, we considered a software-based implementation for Config1.We then included the chaos-based encryption and fusion on the PL side for Config2.Config3 moves the filtering on the PL.While the approximate computation of the DCT is added on the hardware for Config4.Hence, the block diagram of Figure 5 corresponds to the last configuration.Image decomposition to 16 × 16 pixel blocks was done on the PS since no computation is required.

Hard/Soft Co-Design
The processing chains for transmitting and receiving crypto-compressed images are illustrated at the top of Figure 6.The transceiver protocol of the Xbee modules is implemented on the PS side, while the HDMI display is carried out on the PL by using the Xilinx IP for HDMI.The rest of the processing blocks can be implemented on software or on hardware.Therefore, we explored the HW/SW co-design of these blocks by creating four design variants that each implement a different configuration.The reason for creating several configurations is to move processing blocks from PS to PL and to quantify the system performances obtained with each configuration.All of these considered configurations are shown in Figure 7a,b for the transmission and reception chain, respectively.For the transmission chain, we considered a software-based implementation for Config1.We then included the chaos-based encryption and fusion on the PL side for Config2.Config3 moves the filtering on the PL.While the approximate computation of the DCT is added on the hardware for Config4.Hence, the block diagram of Figure 5 corresponds to the last configuration.Image decomposition to 16 × 16 pixel blocks was done on the PS since no computation is required.A similar approach was taken for the reception chain, in which four variants were considered.The first configuration is a software-based solution for decrypting and decompressing multiple images.Config2 decrypts and defuses received images by hardware implementation of the chaosbased decryption and separation.Config3 adds a zero padding block on the PL.The IDCT approximate computation is included on the PL side in Config4.Reconstruction of the four received images was done exclusively in software.The four configurations are depicted in Figure 7.A similar approach was taken for the reception chain, in which four variants were considered.The first configuration is a software-based solution for decrypting and decompressing multiple images.Config2 decrypts and defuses received images by hardware implementation of the chaos-based decryption and separation.Config3 adds a zero padding block on the PL.The IDCT approximate computation is included on the PL side in Config4.Reconstruction of the four received images was done exclusively in software.The four configurations are depicted in Figure 7.

Hard/Soft Co-Design
The processing chains for transmitting and receiving crypto-compressed images are illustrated at the top of Figure 6.The transceiver protocol of the Xbee modules is implemented on the PS side, while the HDMI display is carried out on the PL by using the Xilinx IP for HDMI.The rest of the processing blocks can be implemented on software or on hardware.Therefore, we explored the HW/SW co-design of these blocks by creating four design variants that each implement a different configuration.The reason for creating several configurations is to move processing blocks from PS to PL and to quantify the system performances obtained with each configuration.All of these considered configurations are shown in Figure 7a,b for the transmission and reception chain, respectively.For the transmission chain, we considered a software-based implementation for Config1.We then included the chaos-based encryption and fusion on the PL side for Config2.Config3 moves the filtering on the PL.While the approximate computation of the DCT is added on the hardware for Config4.Hence, the block diagram of Figure 5 corresponds to the last configuration.Image decomposition to 16 × 16 pixel blocks was done on the PS since no computation is required.A similar approach was taken for the reception chain, in which four variants were considered.The first configuration is a software-based solution for decrypting and decompressing multiple images.Config2 decrypts and defuses received images by hardware implementation of the chaosbased decryption and separation.Config3 adds a zero padding block on the PL.The IDCT approximate computation is included on the PL side in Config4.Reconstruction of the four received images was done exclusively in software.The four configurations are depicted in Figure 7.

Resource Utilization
The four previously illustrated variants were implemented on ZedBoard.The number of hardware resources and their ratio compared to available resources are depicted in Table 1.We considered for resource utilization the number of LUTs (look up tables) synthesized as logic elements, the number of DRAM, registers, multiplexers, Block Ram BRAM memory tiles, and the DSP48 blocks.It can be seen that Config1 has the lowest resources utilization ratio, since it includes only the HDMI IP, and AXI buses to interconnect the PS and PL on hardware.In contrary, Config4 has the highest hardware utilization ratio.In all system variants, a maximum of 17% of ZedBoard resources is used, as illustrated in Figure 8.This means that the remaining resources can be used for other image processing blocks (e.g., detection, recognition, etc.).The on-chip power consumption profiling is provided in Table 2.The power numbers quoted are an estimation from a power calculator, not measured.These values are used instead of the measured power because they are easily obtained.When measuring the consumed power of a given algorithm on FPGA, we have to take into consideration the static and the dynamic powers consumed during the run-time.Moreover, it is established that estimated power consumption can be efficient

Resource Utilization
The four previously illustrated variants were implemented on ZedBoard.The number of hardware resources and their ratio compared to available resources are depicted in Table 1.We considered for resource utilization the number of LUTs (look up tables) synthesized as logic elements, the number of DRAM, registers, multiplexers, Block Ram BRAM memory tiles, and the DSP48 blocks.It can be seen that Config1 has the lowest resources utilization ratio, since it includes only the HDMI IP, and AXI buses to interconnect the PS and PL on hardware.In contrary, Config4 has the highest hardware utilization ratio.In all system variants, a maximum of 17% of ZedBoard resources is used, as illustrated in Figure 8.This means that the remaining resources can be used for other image processing blocks (e.g., detection, recognition, etc.).

Resource Utilization
The four previously illustrated variants were implemented on ZedBoard.The number of hardware resources and their ratio compared to available resources are depicted in Table 1.We considered for resource utilization the number of LUTs (look up tables) synthesized as logic elements, the number of DRAM, registers, multiplexers, Block Ram BRAM memory tiles, and the DSP48 blocks.It can be seen that Config1 has the lowest resources utilization ratio, since it includes only the HDMI IP, and AXI buses to interconnect the PS and PL on hardware.In contrary, Config4 has the highest hardware utilization ratio.In all system variants, a maximum of 17% of ZedBoard resources is used, as illustrated in Figure 8.This means that the remaining resources can be used for other image processing blocks (e.g., detection, recognition, etc.).The on-chip power consumption profiling is provided in Table 2.The power numbers quoted are an estimation from a power calculator, not measured.These values are used instead of the measured power because they are easily obtained.When measuring the consumed power of a given algorithm on FPGA, we have to take into consideration the static and the dynamic powers consumed during the run-time.Moreover, it is established that estimated power consumption can be efficient The on-chip power consumption profiling is provided in Table 2.The power numbers quoted are an estimation from a power calculator, not measured.These values are used instead of the measured power because they are easily obtained.When measuring the consumed power of a given algorithm on FPGA, we have to take into consideration the static and the dynamic powers consumed during the run-time.Moreover, it is established that estimated power consumption can be efficient when comparing designs to each other.It was found that the estimated dynamic power consumption was slightly higher in Config4 compared with that of Config1.The use of DCT, IDCT, encryption and decryption with chaos generators, filtering, and zero paddings contribute to an increase in power consumption of about 100 mW.
Finally, we should mention that the data path delay obtained with the four configurations is equal to 2.25 ns.This means that when we incrementally included many processing blocks on the PL, the delay data path did not increase, since all the processing was pipelined.However, with Config1, since the major part of the processing was done in the PS, the execution time was much higher than that obtained with Config2, Config3, and Config4.More particularly, the computation time was measured for the compression and encryption of four gray scale images of 512 × 512 pixels.When we used Config1 and Config4, the computation time was equal to 8.685 s and 897.28 ms, respectively.Therefore, the use of Config4 allowed a speedup of about 9.68.
According to the ETSI metrics for mobile edge computing in [32], the power efficiency µ can be calculated as a relationship between the consumed power and the traffic.
For four gray scale images of 512 × 512 pixels, the compressed bitstream contains 221 bits.The power efficiency obtained by including more processing blocks in the PL is about 213 nJ/bit, while the power efficiency obtained by including processing blocks in the PS is about 1.64 × 10 3 nJ/bit.Thus, Config4 allowed a power efficiency about 7.7 times better than what we could obtain with Config1.We finally should note that hardware performance may be enhanced when using predictive computation techniques as in [33] to predict zero quantized DCT coefficients, or also by a fine hardware optimization of arithmetic operations used in the DCT, as in [34].

Experimental Setup
The proposed experimental testbed setup is presented in Figure 9. Two IoMT nodes were considered, where each node is composed of an Xbee transceiver, connected to a SoC-based ZedBoard by means of a cap card and a PMOD port.The HDMI display could not be used in case of real applications, but we used it in order to see encrypted and decrypted images.Moreover, we used a host computer connected to both nodes with two UART links in order to control the status of the node at run time.This is useful since it can indicate the number of received bits.If the number of received bits is lower, the system will ask for retransmission.
Each node can be used as transmitter or receiver according to the user preference.Once the Xbee modules are configured in the same network and the ZedBoards are powered on and programmed, the user can press on the top push button of the first ZedBoard card and then press on the down push button of the other ZedBoard card.The first card is selected for transmission, while the second is the receiver side.Then the user can select to execute on the emitter node Config1 or Config4 by using the right or left push button and the eight switches for the encryption key.The input images as well as the crypto-compressed image are displayed on the HDMI display.The user can now press the middle button of the second card to open the reception, and then on the middle button of the first card to send crypto-compressed images.On the receiver side, the images can be decoded by Config1 or Config4 as in the encoding step.At this stage, the user has to indicate by switch selection the correct key in order to decrypt images.The test flow is shown in Figure 10.
the user can press on the top push button of the first ZedBoard card and then press on the down push button of the other ZedBoard card.The first card is selected for transmission, while the second is the receiver side.Then the user can select to execute on the emitter node Config1 or Config4 by using the right or left push button and the eight switches for the encryption key.The input images as well as the crypto-compressed image are displayed on the HDMI display.The user can now press the middle button of the second card to open the reception, and then on the middle button of the first card to send crypto-compressed images.On the receiver side, the images can be decoded by Config1 or Config4 as in the encoding step.At this stage, the user has to indicate by switch selection the correct key in order to decrypt images.The test flow is shown in Figure 10.

Results
The FPGA-based implementation results for the IoMT application are presented in Figure 11.The original images and the decrypted images with the correct key are shown on the left, while the

Results
The FPGA-based implementation results for the IoMT application are presented in Figure 11.The original images and the decrypted images with the correct key are shown on the left, while the

Results
The FPGA-based implementation results for the IoMT application are presented in Figure 11.The original images and the decrypted images with the correct key are shown on the left, while the original images and the decrypted images with the wrong key are presented on the right side of Figure 11.A demo of this experimental setup can be found in [35].

Performance Analysis
The use of approximate computing decreases the perceptual quality of the reconstructed images.We calculated the average peak signal-to-noise ratio (PSNR) of reconstructed eight-bit greyscale 512 × 512 images.The forward and inverse approximate transforms are applied to obtain the reconstructed images and to compare PSNR with exact DCT-based compression.It was shown that the proposed approximate DCT results for the PSNR degradation were about 0.8 dB and less than 0.6 dB when a compression ratio of 50% and 12.5% were applied, respectively.Hence, we can conclude that there is a decrease in accuracy, but that the loss is still negligible.
Encryption performance evaluation was studied in [36], where some Matlab codes were developed for simulation purposes.In this work, plaintext and ciphered images were acquired with the processor and encryption analysis was addressed, as illustrated in Figure 12.One of the four input images is the Lena image shown in Figure 12a.Its corresponding ciphered image is shown in Figure 12b.The histogram of both images is depicted in Figure 12c (for plaintext) and Figure 12d (for the ciphered image).It was found that the ciphered image has no particular template that can indicate the content of the plaintext image.The four ciphered images and their corresponding histogram are illustrated in Figure 12e,f, respectively.It was found that the histogram is uniformly distributed, which can protect the image from statistical attacks.Finally, an adjacent pixel correlation analysis was carried out.The horizontal adjacent pixel correlation analysis is depicted in Figure 12g (for four plaintext images) and Figure 12h (for four ciphered images).It was found in the left image of Figure 12h that there is no correlation between adjacent pixels, which signifies a high security level.As in [36], we calculated the coefficient ρ of the adjacent pixel correlation and we found that its value was about 5 × 10 −3 , which means that there is no correlation between adjacent pixels of the ciphered image.
In the case of the use of the correct key for the decryption, the image quality is subjectively evaluated.It was found that up to 10 m and 30 m of distance between the emitter and receiver, the quality of experience was about 90% and 80%, respectively.However, when the emitter and receiver were not in "line of sight", the distance should not exceed 3 m for an acceptable quality of experience.Beyond 50 m of distance, the transmission is not guaranteed.Table 3 summarizes the obtained results.

Performance Analysis
The use of approximate computing decreases the perceptual quality of the reconstructed images.We calculated the average peak signal-to-noise ratio (PSNR) of reconstructed eight-bit greyscale 512 × 512 images.The forward and inverse approximate transforms are applied to obtain the reconstructed images and to compare PSNR with exact DCT-based compression.It was shown that the proposed approximate DCT results for the PSNR degradation were about 0.8 dB and less than 0.6 dB when a compression ratio of 50% and 12.5% were applied, respectively.Hence, we can conclude that there is a decrease in accuracy, but that the loss is still negligible.
Encryption performance evaluation was studied in [36], where some Matlab codes were developed for simulation purposes.In this work, plaintext and ciphered images were acquired with the processor and encryption analysis was addressed, as illustrated in Figure 12.One of the four input images is the Lena image shown in Figure 12a.Its corresponding ciphered image is shown in Figure 12b.The histogram of both images is depicted in Figure 12c (for plaintext) and Figure 12d (for the ciphered image).It was found that the ciphered image has no particular template that can indicate the content of the plaintext image.The four ciphered images and their corresponding histogram are illustrated in Figure 12e,f, respectively.It was found that the histogram is uniformly distributed, which can protect the image from statistical attacks.Finally, an adjacent pixel correlation analysis was carried out.The horizontal adjacent pixel correlation analysis is depicted in Figure 12g (for four plaintext images) and Figure 12h (for four ciphered images).It was found in the left image of Figure 12h that there is no correlation between adjacent pixels, which signifies a high security level.As in [36], we calculated the coefficient ρ of the adjacent pixel correlation and we found that its value was about 5 × 10 −3 , which means that there is no correlation between adjacent pixels of the ciphered image.
In the case of the use of the correct key for the decryption, the image quality is subjectively evaluated.It was found that up to 10 m and 30 m of distance between the emitter and receiver, the quality of experience was about 90% and 80%, respectively.However, when the emitter and receiver were not in "line of sight", the distance should not exceed 3 m for an acceptable quality of experience.Beyond 50 m of distance, the transmission is not guaranteed.Table 3 summarizes the obtained results.

Conclusions
In this paper we proposed an experimental platform for mobile edge computing to be used in the context of the IoMT.The proposed platform could be considered as the cornerstone of IoMT applications, including video surveillance, face recognition, pedestrian detection, etc.We have demonstrated that by algorithmic and architectural optimizations, we can decrease the constraints related to time and energy consumption.Moreover, we have shown that by using a HW/SW co-design, we can decrease the power efficiency of the IoMT node.Future work will focus on including more image and video processing applications in the HW part of the node.

Figure 1 .
Figure 1.The European Telecommunications Standards Institute (ETSI) use case of the mobile edge computing applied to multimedia things.

Figure 1 .
Figure 1.The European Telecommunications Standards Institute (ETSI) use case of the mobile edge computing applied to multimedia things.

Figure 2 .
Figure 2. Scheme of the simultaneous compression, fusion and encryption.

Figure 2 .
Figure 2. Scheme of the simultaneous compression, fusion and encryption.

Figure 3 .
Figure 3. Flow chart of the encryption scheme.

Figure 3 .
Figure 3. Flow chart of the encryption scheme.

Figure 5 .
Figure 5. Block diagram of the proposed system implemented on ZedBoard.Figure 5. Block diagram of the proposed system implemented on ZedBoard.

Figure 5 .
Figure 5. Block diagram of the proposed system implemented on ZedBoard.Figure 5. Block diagram of the proposed system implemented on ZedBoard.

Figure 6 .
Figure 6.Hardware setup of the node architecture.

Figure 6 .
Figure 6.Hardware setup of the node architecture.

Figure 6 .
Figure 6.Hardware setup of the node architecture.

Figure 10 .
Figure 10.Test flow applied to the experimental setup.

Figure 10 .
Figure 10.Test flow applied to the experimental setup.

Figure 10 .
Figure 10.Test flow applied to the experimental setup.

J
. Low Power Electron.Appl.2018, 8, 1 14 of 18 original images and the decrypted images with the wrong key are presented on the right side of Figure 11.A demo of this experimental setup can be found in [35].

Figure 11 .
Figure 11.FPGA implementation results.Left figure shows the case decryption with the correct key.Right figure shows the case of decryption with the wrong key.

Figure 11 .
Figure 11.FPGA implementation results.Left figure shows the case decryption with the correct key.Right figure shows the case of decryption with the wrong key.

Table 1 .
Resource utilization for several configurations.

Table 1 .
Resource utilization for several configurations.

Table 1 .
Resource utilization for several configurations.

Table 3 .
Evaluation of the transmission quality.