Lightweight Cipher for H.264 Videos in the Internet of Multimedia Things with Encryption Space Ratio Diagnostics

Within an Internet of Multimedia Things, the risk of disclosing streamed video content, such as that arising from video surveillance, is of heightened concern. This leads to the encryption of that content. To reduce the overhead and the lack of flexibility arising from full encryption of the content, a good number of selective-encryption algorithms have been proposed in the last decade. Some of them have limitations, in terms of: significant delay due to computational cost, or excess memory utilization, or, despite being energy efficient, not providing a satisfactory level of confidentiality, due to their simplicity. To address such limitations, this paper presents a lightweight selective encryption scheme, in which encoder syntax elements are encrypted with the innovative EXPer (extended permutation with exclusive OR). The selected syntax elements are taken from the final stage of video encoding that is during the entropy coding stage. As a diagnostic tool, the Encryption Space Ratio measures encoding complexity of the video relative to the level of encryption so as to judge the success of the encryption process, according to entropy coder. A detailed comparative analysis of EXPer with other state-of-the-art encryption algorithms confirms that EXPer provides significant confidentiality with a small computational cost and a negligible encryption bitrate overhead. Thus, the results demonstrate that the proposed security scheme is a suitable choice for constrained devices in an Internet of Multimedia Things environment.


Introduction
An Internet of Things (IoT) is a networked architecture [1], of which the Internet of Multimedia Things (IoMT) [2] is an emerging sub-set, integrating many devices and sensors at the Internet edge. In IoMT applications, video-surveillance devices might be deployed in various scenarios, such as within public transport management systems (managing buses, airplanes or road traffic), health management services (for patient or child monitoring), personal asset protection (within homes or construction sites) and many more [3]. The aim is to make these devices intelligent by allowing them to interact with each other, that is, they become smart objects. Storage and later analysis of data [4] can be on remote cloud data centers. However, even more so than within the traditional Internet, the IoT architecture [5] has inherent security weaknesses of which this paper focuses on Q1. Are those encryption schemes computationally efficient enough (in terms of execution/encoding time) to employ in an IoMT communication environment? Q2. Is the analyzed ESR effective enough to apply SE to visually secure the videos encoded with one or other of the two common entropy coders in common codec use, that is, context adaptive variable length coding (CAVLC) and context adaptive binary arithmetic coding (CABAC) (see Section 4.1)?
The focus of this paper is to answer those questions by experimentation. In the case of full (sometimes known as) naïve encryption, the complete video is encrypted. Therefore, the encryption overhead and the space ratio are at a maximum, which causes a bitrate overhead too. Herein, such weaknesses are addressed by proposing a complete lightweight security scheme for IoMT applications on a standardized H.264/advanced video coding (AVC) encoder for constrained surveillance devices. Notice that HEVC is currently too resource intensive to be used in an IoMT environment, see [14]. In the proposed scheme, SE is applied through the proposed encryption algorithm EXPer, after identification by means of ESR of an effective level of visual protection. In other words is there sufficient visual distortion within the video frames to preserve the confidentiality of the content. The contributions of the paper are further summarized in Section 1.1.

Context
SE already has a potential role in consumer electronics applications [15] and also can support interoperability [16], when multiple encryptions of the same video stream are transported. Alternatively, region-of-interest (ROI) encryption of some parts of a video frame such as the face or people within a frame [17,18], may reduce the encryption overhead. However, ROI encryption is application specific, while SE potentially offers a more general solution. Compared to full (or naïve) encryption, both SE and ROI encryption can reduce computational and bitrate overhead [19]. SE may be carried out on the most significant information (as regards distortion) at a choice of different stages of the codec, such as on the original pixels, the transform coefficients, the quantization indexes, the bit-planes, the entropy coder, or the final output bitstream [20]. However, some forms of encryption alter the video statistics, resulting in encryption bitrate overhead and lack of format compliance at the decoder. Applying encryption at the entropy coding stage minimizes those problems [17,18], which is why encryption at that stage of compression is chosen for this paper.
Entropy coders are a feature of standardized hybrid video encoders [21]. H.264/AVC [22] and its scalable video coding (SVC) extension [23][24][25] employ the same entropy coding modes: variable length coding (VLC) or binary arithmetic coding (BAC). Both of these modes operate in a context-adaptive (CA) manner, leading to the names CAVLC [26] and CABAC [27] entropy coders. Within H.264/AVC either CABAC or CAVLC entropy coders can be selected, as the two coders trade-off computational complexity against compression efficiency. The HEVC [28] CABAC coder is a slightly modified version of the H.264/AVC CABAC encoder and, thus, entropy-integrated SE can be configured [13] to work with either codec. However, using HEVC for an IoT is questionable owing to its high computational complexity, except possibly when encoding takes place at a cluster head with maximum energy [29].
The H.264/AVC codec is selected for implementation of lightweight encryption because, in both the CCTV industry and for smart monitoring in an IoT, surveillance devices (cameras) mostly operate on microprocessors, especially the RasPi, which only supports video compression in the H.264/AVC format [30]. In an IoT, the RasPi is an economical and privileged platform because it offers a complete Linux server on a tiny platform. To the best of the authors' knowledge, SE utilized with ESR diagnostics is a novel contribution. In general, the contributions of this paper are: 1.
Design of a joint crypto-compression scheme acting upon selected video syntax elements output by the entropy engine of an H.264/AVC encoder. SE is applied, keeping in mind the requirements of IoMT devices.

2.
Ensuring that those selected syntax elements do not, once encrypted, have the potential to crash a decoder at the receiver. In other words, ensuring that they are format compatible with the H.264/AVC standard, even when encrypted.

3.
Application of ESR to the output from two entropy coders, CAVLC and CABAC. ESR is used as a diagnostic tool to obtain efficient SE. ESR was applied to ten test or benchmark video sequences.
(The same procedure has been applied to video clips captured by constrained RasPi cameras.) 4.
Finding that the ESR estimate for CABAC is less than CAVLC, which implies that the CABAC entropy coder is suitable for IoMT cameras. 5.
Introducing an innovative, single-round, lightweight cipher EXPer (with five sequential steps of block-level eXclusive OR (XOR) cipher and bit-level permutation). 6.
Performing a series of experiments with EXPer upon the output of an H.264/AVC encoder, comparing that cipher to simple XOR and the industry-standard advanced encryption standard (AES) [11]. Therefore, three different ciphers are extensively tested (from different aspects) on multiple videos with varying color and motion characteristics. The perceptual strength of the ciphers is compared through different video quality metrics and their computational efficiency is evaluated in terms of execution time on tested videos. 7.
Performing cryptanalysis that EXPer is secure against a variety of attacks, including key guessing, perceptual, known-plaintext attacks as well as statistical and inference attacks.
The remainder of this paper is organized as follows. In Section 2, prior efficient lightweight schemes proposed by researchers of IoT are discussed. Section 3 presents the proposed lightweight cipher scheme EXPer, the adopted SE methodology, and diagnostics through ESR. Section 4 describes the promising results over tested videos. Section 5 is a comparative analysis of EXPer with other ciphers (see point 6 of the contributions). Finally, Section 6 rounds off by considering the implications for those planning IoMT video applications with a concern for confidential video content.

Related Studies
The heterogeneity of the communication technologies across an IoMT, deriving from the assortment of devices, sensors, and protocols, is a cause of security concern. Messages that are transmitted from smart objects will usually be stored and forwarded through several nodes (such as through video sensors, video relay, transcoders, or other intermediate devices) to reach their endpoint [7]. An endpoint might be a message sink or base station (BS), which passes data to the next layer of the architecture, as occurs in traditional sensor networks when a sink communicates over a satellite link. Figure 1 shows end-to-end communication of multimedia sensor networks within an IoT. Again, there is a need to preserve confidentiality over that link but there is also a need to optimize energy consumption and reduce latency.
To ensure confidential communication of information in an IoT, many schemes have been proposed, schemes which employ existing state-of-art encryption algorithms [31][32][33][34][35][36][37]. Table 1 is a summary of some proposed encryption schemes for IoT with standard cipher algorithms.
In addition encryption, researchers have also proposed authentication schemes for IoT with existing standardized algorithms. Lee et al. [38] proposed a lightweight authentication protocol for RFID systems. In that protocol, privacy protection and anti-counterfeiting was achieved by an encryption algorithm based on XOR manipulation. Mahalle, et al. [39] proposed identity authentication and capability-based access control (IACAC). That scheme provided both authentication and access control for an IoT. However, the scheme results in extra overhead, due to its key-management procedures. The authors of [40] employed encryption and hash algorithms in their proposed solution to achieve confidentiality and message integrity in an IoT. All the same, their solution fails to deal with large amounts of multimedia data because of the proposed encryption algorithm's ability to encrypt only 64 bits per block; hence, it suffers from slow operation.  . An endpoint might be a message sink or base station (BS), which passes data to the next layer of the architecture, as occurs in traditional sensor networks when a sink communicates over a satellite link. Figure 1 shows end-to-end communication of multimedia sensor networks within an IoT. Again, there is a need to preserve confidentiality over that link but there is also a need to optimize energy consumption and reduce latency.  Although, the encryption schemes given in Table 1 provide higher confidentiality, the authors did not consider multimedia content when evaluating the performance of their schemes. Moreover, traditional encryption algorithms, such as AES and triple data encryption standard (DES) encryption, as used in the proposed schemes, are inefficient for an IoMT because of their computationally intensive nature. Hence, those schemes appear to be unsuited to the requirements of real-time IoMT applications, due to their relatively high bitrate overhead, computational overhead, and bandwidth utilization. Consequently, lightweight encryption algorithms are required to alleviate those overheads for low-cost, low-power devices. Recently, there has been much interest shown by researchers and standardization bodies in designing lightweight algorithms for secure end-to-end communication in an IoT. All cryptographic algorithms are based on three principles 1) substitution 2) XOR and 3) permutation. Thus, the newly proposed algorithms by other researchers are also based on these principles [41][42][43][44][45][46][47][48]. Recently, the authors of [48] proposed a one-round cipher (implemented on static images) for IoMT in which the substitution and permutation principles were selected for the encryption. However, substitution is considered resource expensive and should be used with caution over videos, especially for the resource-limited devices of an IoMT. Therefore, to avoid the computational overhead common to substitution, shuffling or permutation is employed in lightweight image encryption algorithms [41,43]. In this current study, a lightweight cipher is designed that additionally employs the XOR principle. It consists of a single round of five sequential steps, (three XOR steps and two permutations). Table 2 is an overview of some recently proposed lightweight encryption algorithms in comparison with our proposed encryption algorithm for IoMT communication.
Likewise, to avoid extra computational overhead and bitrate control, chaos theory is also utilized to implement the encryption process for IoMT systems [49,50]. Chaos theory has proved attractive because of its simplicity and statistical qualities leading to randomized output. Generally, chaotic algorithms are based on a chaotic map and s-box substitution, with multiple rounds to create randomization. However, in the substitution process, multiple rounds to create the desired random output increase the execution time. In fact, in [51] doubt has been cast upon the computational gain from employing chaotic encryption, compared to traditional block-based encryption, such as through AES. Indeed, statistical tests often used to verify the confidentiality of chaotic encryption fail to highlight known insecure encryption algorithms, casting doubt on the claimed security properties. In general, researchers have proposed ciphers for general-purpose applications and do not consider the specific ESR required to provide effective visual protection especially in images and videos. EXPer, an innovative lightweight cipher based on a combination of XOR and bit-level permutation rounds, with three different 128-bit keys. 3.
ESR is calculated according to the selected syntax elements of CAVLC and CABAC, as a way of diagnosing the effective visual protection (Section 4).

4.
SE is applied by utilizing EXPer, according to the guidance given by ESR (Section 4).

Syntax Elements Selection of Entropy Coders
There are two forms of entropy engines for efficient compression in H.264/AVC video encoder, CAVLC [21] and CABAC [22]. Both CAVLC and CABAC are a lossless form of coding (after earlier lossy encoding stages of the hybrid codec) in which there is a tight data dependency between elements in the output bitstream. CAVLC employs the concatenation of Unary and Exp-Golomb coding for a number of parameters, such as macroblock (MB) type (i.e., the prediction method-inter or intra), coded block pattern (CBP) (which records which blocks within an MB contain non-zero (NZ) transform coefficient (TC) residuals), delta QP, reference frame index, and motion vector differences (MVDs). Quantized transform block coefficients (residuals) are VLC coded after extraction (normally through zigzag scanning) from a block. CABAC gains in efficiency over CAVLC as syntax elements are first converted to a binary format. This allows binary arithmetic coding to be utilized. Arithmetic coding leads to sub-integer probability estimation (unlike CAVLC) but is computationally expensive.
In this paper, for the convenience of implementation the focus is not on a newly proposed SE encoder technique. Therefore, we have employed our previous SE schemes for example as reported in [52], with one newly identified parameter for enhanced visual protection of videos. For the convenience of new researchers, we give simple names to these selected parameters, such as (1) motion, dealing with the movement of objects in videos including camera zooming and other viewing adjustments (2) texture data for pixels information, and a new parameter (3) difference of quantization parameters (deltaQP). Their details can be found in [26,27].
It is also worth mentioning here that these mentioned three types of parameters are produced only from residual information, as presented to the final entropy coding stage of a hybrid encoder and are proven to be format compliant in experiments. Textural residuals are taken from both homogeneous and heterogeneous areas within a video. For motion encryption, the arithmetic signs of motion vector difference (MVD) are encrypted while textural syntax elements are different for both CAVLC and CABAC (given in the top grey box of Figure 2), and the absolute values of dQP are selected for encryption. The selected syntax elements of CABAC will be referred as Bins in the subsequent sections of this paper. A block diagram of the proposed security scheme is given in Figure 2.
are proven to be format compliant in experiments. Textural residuals are taken from both homogeneous and heterogeneous areas within a video. For motion encryption, the arithmetic signs of motion vector difference (MVD) are encrypted while textural syntax elements are different for both CAVLC and CABAC (given in the top grey box of Figure 2), and the absolute values of dQP are selected for encryption. The selected syntax elements of CABAC will be referred as Bins in the subsequent sections of this paper. A block diagram of the proposed security scheme is given in Figure 2.

Lightweight Cipher
The proposed lightweight cipher, EXPer, provides both diffusion and confusion primitives, through XORing and permutation. Because the normal substitution process (in mainstream ciphers such as AES) is computationally intensive, so it is not included in the proposed cipher. Moreover, the different forms of XOR are also not computed over videos, because their adoption can be effective for a computation over single image rather than a whole video with large quantity of surveillance

Lightweight Cipher
The proposed lightweight cipher, EXPer, provides both diffusion and confusion primitives, through XORing and permutation. Because the normal substitution process (in mainstream ciphers such as AES) is computationally intensive, so it is not included in the proposed cipher. Moreover, the different forms of XOR are also not computed over videos, because their adoption can be effective for a computation over single image rather than a whole video with large quantity of surveillance frames. Although permutation is applied on bit-level within selected byte to effectively secure the camera captured videos.
EXPer encryption consists of five steps/stages with a single iteration over those steps. In each step, XOR is performed using a secret key (k_1), and the bit-wise permutation by the shift operation. Permutation is performed with two randomly selected offsets, v_1, 2, ranging in value from 1 to 8. Additionally, permutation is performed on the output from the previous stage to provide significant statistical randomness with a reduced computational complexity. The permutation is applied as the bit-level, by re-ordering the bits within each byte. The impact of this permutation is not easily compensated for by an attacker, given that large volumes of video data are involved The symmetric secret keys: secret key (k_1), sub_key1 (k_2), sub_key2 (k_2) are dynamically generated at run-time for each input bitstream, by using a pseudo-random function (PRF). To keep the procedure simple, three dynamic keys are generated per video sequence and stored in a registry. Key security can be enhanced by using any standardized key management scheme [53]. Additionally, not one but three 128-bit secret keys have been utilized in the XOR operation. Notice that a key space greater than 2 100 is considered resilient to key guessing or brute force attacks over keys [54]. Furthermore, selective encryption of selected syntax elements within large volume of videos data has proven to be strong [55] (but see also the updated cryptanalysis of Section 5.5). Moreover, the selected offset values will permute the bits within each byte. We consider the proposed algorithm to be a stream cipher because of the resulting byte-level encryption, in addition to the bit-level XOR operations.

Working of EXPer
The five steps of the proposed algorithm are discussed in more detail below: Step 1: In the first step, the input bitstream is encrypted by performing an XOR operation with a 128-bit secret key. Let X be the selected syntax elements (or bins) of the CABAC entropy coder. Then, X is XORed with the secret key, k_1. In other words, each 128-bit block of the stream of selected syntax elements, extracted from the original compressed bitstream, is encrypted and then in encrypted form is placed back into the output bitstream: where ⊕ is the XOR operator, X contains the syntax elements (or bins) selected from any input bitstream, and X' contains the resulting encrypted syntax elements, grouped in blocks of 128-bits.
Step 2: In the second step, a permutation is applied to the encrypted output X' of (1). Thus, the selected and encrypted syntax elements are re-ordered through the permutation. Specifically, the elements of X', on a byte-by-byte basis, are cyclically permutated by offset value v_1 using a circular right-shift operator. The shift operation in general is represented in (2), in which the input bytes are transformed, byte-by-byte, into the encrypted and permuted output, as shown in (2) for an offset value one, and in general in (3) by the value of v_1 where → is a transformation symbol and >>> is a circular shift operator that signifies transforming X' into the X" bitstream, through a circular right-shift of the bits of each byte of X'. v_1 denotes an offset value.
Step 3: In the next step, the resulting output of the previous step, which is X", is again transformed by an XOR operation with the 128-bit sub-key k_2, as: As already mentioned in Section 3.2, k_2 is derived by means of a PRF.
Step 4: In the fourth step, the previously encrypted output X"' is permutated once again with offset value v_2 , again by a circular right-shift operator, applied on a per byte basis: Step 5: In the final step, the resulting bitstream, X"", is XORed with the 128-bit sub_key2 (k_3), to produce encrypted bitstream E output : X""⊕ k_3 = E output (6) Subsequently, the bits of encrypted bitstream E output are re-merged with the compressed video bitstream. In that way, a decoder receives a format compatible bitstream, according to the format of the H.264/AVC standard.
The proposed algorithm is simple and, thus, convenient to implement even on videos directly taken from RaspPi cameras. The pseudo-code and flowchart in Figure 3 demonstrates the simplicity of a software implementation with encryption and decryption rounds.
Subsequently, the bits of encrypted bitstream Eoutput are re-merged with the compressed video bitstream. In that way, a decoder receives a format compatible bitstream, according to the format of the H.264/AVC standard.
The proposed algorithm is simple and, thus, convenient to implement even on videos directly taken from RaspPi cameras. The pseudo-code and flowchart in Figure 3 demonstrates the simplicity of a software implementation with encryption and decryption rounds.

Experimental Results and Discussion
In order to evaluate the performance, experiments were performed on ten well-known [56] test videos with varying characteristics, such as slow/fast motion and light/dense colors. The tested videos' configurations were based upon common intermediate format (CIF) (352 × 288 pixels/frame) at 30 fps, 4:2:0 chroma sampling, IBBP, group of pictures (GoP) frame structure and an intra-refresh period of length 16, with H.264/AVC. The videos are evaluated on different QP values. (The H.264/AVC the range of QPs is from 0 to 51, corresponding to higher compression with lower QPs). All experiments were performed on a 64-bit operating system with 2.30 GHz Core i5-6200U processor and 8 GB RAM. The algorithm was developed using the C/C++ programming language by modifying the JSVM reference software with a single layer [57].

Experimental Results and Discussion
In order to evaluate the performance, experiments were performed on ten well-known [56] test videos with varying characteristics, such as slow/fast motion and light/dense colors. The tested videos' configurations were based upon common intermediate format (CIF) (352 × 288 pixels/frame) at 30 fps, 4:2:0 chroma sampling, IBBP, group of pictures (GoP) frame structure and an intra-refresh period of length 16, with H.264/AVC. The videos are evaluated on different QP values. (The H.264/AVC the range of QPs is from 0 to 51, corresponding to higher compression with lower QPs). All experiments were performed on a 64-bit operating system with 2.30 GHz Core i5-6200U processor and 8 GB RAM. The algorithm was developed using the C/C++ programming language by modifying the JSVM reference software with a single layer [57].

Calculation of ESR for Entropy Coders
Before applying SE with EXPer on test videos, the focus of this paper is to analyze the ESR of two entropy coders over which the SE is applied. ESR is basically the amount of data within each video (calculated in terms of percentages) over which the SE produces acceptable visual protection results (see also Section 1's introduction to ESR). The ESR percentage is directly proportional to the computational cost of applying SE over videos that is, the more ESR, the more computational cost for SE and vice versa. The ESR for videos calculated on the bases of selected syntax elements (as specified in Section 3.2). The tested videos are listed in Table 3 and configured as described at the beginning of this section.
Taking the ESR percentages for CAVLC first in Table 3, it is apparent that there is a considerable content dependency, probably linked to the spatial complexity of individual video frames and the temporal complexity, due to the level of motion activity within sequences. The ESR percentage of motion elements is much lower than that of the texture ESR, which arises from spatial complexity. However, from the observations of Section 4.1, the ESR value calculated for motion parameter SE elements alone is insufficient in itself to guarantee encryption confidentiality. However, the ESR value for motion and texture SE elements, when those syntax elements are derived from CAVLC can be considerable with a maximum ESR of 29.27% for the Flower test video. It is also the case, that despite the view that SE syntax elements/parameters can be chosen so that in a statistical sense there is little impact on the bitrate overhead, in fact, from experimental evidence, the encryption ratio appears to be considerable. Compared with SE of CABAC syntax elements in Table 3, the ESR value is considerably lower for CABAC, with an average (arithmetic mean) of 12% for CAVLC and 7.5% for the CABAC encoded test videos. Additionally, the maximum encryption ratio drops to 13.22% for CABAC and for mobile rather than the flower video. Given that CABAC already has an advantage in terms of compression efficiency (refer back to Section 3), so CAVLC must be adopted with caution for IoT applications. This is unfortunate given the reduced computational overhead arising from CAVLC and its inclusion in H.264/AVC's baseline-type profiles. The pictorial comparison of ESR calculated for CAVLC and CABAC is illustrated in Figure 4. It is also worth noticing in Table 3 and Figure 4 that for all tested samples, CAVLC produces more texture information than CABAC. This property consequently provides more texture ESR for encryption and, hence, produces more computational overhead than CABAC. For this reason, the EXPer experiments were performed with ESR on CABAC in the next section.

Performance of EXPer
For the evaluation of results with EXPer, the experiments were performed on several test video sequences with the selected parameters such as those based on motion, texture, delta QP, and together with their combinations. The SE is applied with CABAC on all tested videos because CABAC is more compression efficient (refer back to Section 3) and produces less ESR as compared to CAVLC. Another reason for choosing CABAC is that the encryption of the by-pass syntax elements in CABAC

Performance of EXPer
For the evaluation of results with EXPer, the experiments were performed on several test video sequences with the selected parameters such as those based on motion, texture, delta QP, and together with their combinations. The SE is applied with CABAC on all tested videos because CABAC is more compression efficient (refer back to Section 3) and produces less ESR as compared to CAVLC. Another reason for choosing CABAC is that the encryption of the by-pass syntax elements in CABAC does not affect the context models and encrypting at the entropy coder stage does not affect bitstream compliance at the decoder, which is why by-pass CABAC syntax elements are more appropriate for SE in IoMT. Table 4 shows the calculated ESR (ratio) for CABAC syntax elements. The visual results with the proposed encryption algorithm EXPer on CIF video sequences mobile, ICE and Stefan are presented in Figure 5, which imply that sufficient confidentiality is achieved without generating encryption overhead. The ESR with CABAC for ten videos is depicted in Table 4. The ESR of delta QP is only 0.04% while 0.34% and 6.92% with only motion and texture respectively for the ICE video. The ESR for all parameters for ICE video is 4.25% with the CABAC coder, which is 95.75% less than the data for naïve encryption. The average ESR for test videos with all parameters combined (i.e., motion, texture and delta QP) is 8.69% which is minimal and can be adopted by IoT devices.
Furthermore, as previously mentioned it is also worth noticing that the ESR with delta QP is comparatively lower than the ESR of only motion or only texture parameters for all tested videos. Thus, the ESR of delta QP combined with motion (0.50% for mobile video and 0.14% for ICE video) or delta QP combined with texture (12.80% for mobile video and 4.15% for ICE video) is also less as compared to the encryption ratio with combined motion and texture (13.22% for mobile video and 4.21% for ICE video). The important point to note here is that the SE on the absolute values of dQP is not possible with complex cipher algorithms, because the number of rounds in complex ciphers makes these values out of range which destroys the format compliance and compression efficiency of the bitstream, and consequently crashes the decoder [52].
However, in this paper, visual results in Figure 5a4,b4,c4 show the effectiveness of EXPer algorithm, as the SE on dQP syntax elements is implemented in a way that their absolute values do not go out of range and, as a result, the compression efficiency and format compliance of videos are both maintained. This format compliance cannot be achieved through the AES algorithm. Overall, the results in Figure 5 indicate that EXPer encryption provides sufficient visual protection with a lower computational and encryption cost than conventional AES encryption.
of the bitstream, and consequently crashes the decoder [52].
However, in this paper, visual results in Figure 5a4, b4, c4 show the effectiveness of EXPer algorithm, as the SE on dQP syntax elements is implemented in a way that their absolute values do not go out of range and, as a result, the compression efficiency and format compliance of videos are both maintained. This format compliance cannot be achieved through the AES algorithm. Overall, the results in Figure 5 indicate that EXPer encryption provides sufficient visual protection with a lower computational and encryption cost than conventional AES encryption.

Computational Cost Analysis
To analyze the performance of the EXPer, the absolute encryption time in seconds was measured for CIF videos. Table 5 shows that the time is negligible compared to the compression time. Thus, the results show that EXPer encrypts the videos with a low computational cost, which is on average 3.1% of the H.264/AVC compression time (when encoded with CABAC) without encryption. Notice that the absolute encryption time is taken separately to the Encoding time (

Computational Cost Analysis
To analyze the performance of the EXPer, the absolute encryption time in seconds was measured for CIF videos. Table 5 shows that the time is negligible compared to the compression time. Thus, the results show that EXPer encrypts the videos with a low computational cost, which is on average 3.1% of the H.264/AVC compression time (when encoded with CABAC) without encryption. Notice that the absolute encryption time is taken separately to the Encoding time (

Security Analysis of EXPer
The results obtained from various experiments in this Section validate the robustness of EXPer.

Perceptual Security
Perceptual quality is considered an important check on the strength of encryption algorithms. An encryption algorithm is considered robust if it succeeds in distorting a video sequence in such a way that an observer visually fails to detect any useful information from the encrypted bitstream. Clearly, the term 'useful' is dependent on the purpose that the video is to be put to, which, herein, is assumed to be IoMT purposes. The visual results of Figure 5 already show that the video sequences encrypted with EXPer produce distorted results compared to the original video sequence. Furthermore, to evaluate the structural distortion of the proposed algorithm, 3 × 3 Laplacian edge detection [58] was performed. The detected edges of the plaintext video and encrypted video frames are illustrated in Figure 6. The comparative results in Figure 6a2,b2 with those of Figure 6a3,b3 show that the SE with EXPer distorts the video in a way that the attacker cannot easily acquire similarity information from edges of the encrypted video. Clearly, the term 'useful' is dependent on the purpose that the video is to be put to, which, herein, is assumed to be IoMT purposes. The visual results of Figure 5 already show that the video sequences encrypted with EXPer produce distorted results compared to the original video sequence. Furthermore, to evaluate the structural distortion of the proposed algorithm, 3 × 3 Laplacian edge detection [58] was performed. The detected edges of the plaintext video and encrypted video frames are illustrated in Figure 6. The comparative results in Figure 6a2, b2 with those of Figure 6a3, b3 show that the SE with EXPer distorts the video in a way that the attacker cannot easily acquire similarity information from edges of the encrypted video.

Peak signal to noise ratio (PSNR)
PSNR [59] measures the maximum possible absolute differences between the original bitstream and the encrypted bitstream in decibels and is calculated as: where m and n are the width and height of the video frame under consideration, while X and Y represent the pixel's intensity values of the two frames being compared. For a video sequence, the PSNRs of the frames are averaged across the sequence. A lower value of PSNR indicates less similarity between an original video sequence and the video sequence reconstructed from a compressed and encrypted bitstream. Table 6 (columns 2, 3) demonstrates the average (arithmetic mean) PSNR of test video sequences after SE (with EXPer) and without encryption (video only compressed). The results show that the average PSNR value is much lower. Hence, the proposed SE with EXPer encryption produced the highly distorted video. Thus, EXPer can be considered for video protection in IoMT.

Pixel-correlation analysis
Another statistical method to compute the similarity between the original and encrypted pixels of the video frame is cross-correlation. The cross-correlation coefficient, r, is calculated as: where , are the mean intensity values of pixels the original and distorted video frames. The

Peak signal to noise ratio (PSNR)
PSNR [59] measures the maximum possible absolute differences between the original bitstream and the encrypted bitstream in decibels and is calculated as: where m and n are the width and height of the video frame under consideration, while X and Y represent the pixel's intensity values of the two frames being compared. For a video sequence, the PSNRs of the frames are averaged across the sequence. A lower value of PSNR indicates less similarity between an original video sequence and the video sequence reconstructed from a compressed and encrypted bitstream. Table 6 (columns 2, 3) demonstrates the average (arithmetic mean) PSNR of test video sequences after SE (with EXPer) and without encryption (video only compressed). The results show that the average PSNR value is much lower. Hence, the proposed SE with EXPer encryption produced the highly distorted video. Thus, EXPer can be considered for video protection in IoMT.

Pixel-correlation analysis
Another statistical method to compute the similarity between the original and encrypted pixels of the video frame is cross-correlation. The cross-correlation coefficient, r, is calculated as: where X, Y are the mean intensity values of pixels the original and distorted video frames. The values of the r ranges from 1 to −1. When two frames are the same, the correlation index is at a maximum, which is 1. Therefore, a lower value of correlation coefficient indicates higher distortion as a result of encryption. Table 6 (columns 4 and 5) presents calculated cross correlations between encrypted and compressed video frames. The average value of the pixel correlations among the plaintext and encrypted video frames is near to zero, confirming that video sequences encrypted with EXPer are considerably distorted in a statistical sense, thus providing good confidentiality. The correlation between adjacent pixels within video frames in the different directions (horizontal, vertical and diagonal) for plaintext and encrypted mobile video are shown in Figure 7. The correlation test is performed by taking randomly N = 6000 pairs of adjacent pixels from the original and selectively encrypted test video frames. maximum, which is 1. Therefore, a lower value of correlation coefficient indicates higher distortion as a result of encryption. Table 6 (columns 4 and 5) presents calculated cross correlations between encrypted and compressed video frames. The average value of the pixel correlations among the plaintext and encrypted video frames is near to zero, confirming that video sequences encrypted with EXPer are considerably distorted in a statistical sense, thus providing good confidentiality. The correlation between adjacent pixels within video frames in the different directions (horizontal, vertical and diagonal) for plaintext and encrypted mobile video are shown in Figure 7. The correlation test is performed by taking randomly N = 6000 pairs of adjacent pixels from the original and selectively encrypted test video frames.

Structural similarity index (SSIM)
The SSIM index [60] is a metric which gauges the structural similarity between original and reconstructed video frames, having a range normally from 0 to1. Values of SSIM nearer to 0 means less structural similarity between the plaintext and the reconstructed encrypted bitstream, which means greater distortion has occurred. Values nearer to 1 means more structural similarity. The SSIM

Structural Similarity Index (SSIM)
The SSIM index [60] is a metric which gauges the structural similarity between original and reconstructed video frames, having a range normally from 0 to1. Values of SSIM nearer to 0 means less structural similarity between the plaintext and the reconstructed encrypted bitstream, which means greater distortion has occurred. Values nearer to 1 means more structural similarity. The SSIM values on videos by applying SE with EXPer are reported in Figure 8. The SSIM plots make clear that videos are drastically changed when EXPer is applied on selected combined parameters and that it would be extremely difficult to extrapolate the encrypted parts. values on videos by applying SE with EXPer are reported in Figure 8. The SSIM plots make clear that videos are drastically changed when EXPer is applied on selected combined parameters and that it would be extremely difficult to extrapolate the encrypted parts.

Comparison of EXPer with State-of-the-Art Ciphers
To confirm the value of EXPer, a comparison with the most commonly used encryption algorithms XOR and AES was performed. XOR can be considered the most suitable encryption algorithm for IoT applications due to its simplicity and lower computational complexity. However, baseline XOR provides limited confidentiality for images, due to potentially high cross-correlations, and, therefore, AES can be utilized to provide greater confidentiality. Though AES is robust against known-plaintext, brute force, and statistical attacks, it incurs a higher encoding and decoding overhead, which is expensive for resource-constrained IoMT devices.
Results were taken with both state-of-the-art ciphers to compare their performance with EXPer. Comparative visual results with XOR, AES-CFB [11], and EXPer with CABAC coding are presented in Figure 9. Figure 9 b1-b3, c1-c3 and d1-d3 depict SE of the videos with three ciphers. The comparative results in Figure 9c vs. d imply that EXPer provides the same level of visual protection and robustness as AES. It is worth mentioning here that dQP encryption is not applied in these comparative results, as AES rounds make the dQP encrypted video non-format compliant.

Comparative Visual Quality Analysis
For video quality analysis of these three ciphers, PSNR and SSIM results were also taken. Table 7 is a PSNR comparison between XOR, AES-CFB, and EXPer with combined motion and texture parameter encryption on different QPs. The luminance (Y)-PSNR of the mobile video sequence is 7.18 dB, 6.07 dB, and 6.27 dB after encryption with XOR, AES and EXPer respectively. While a noticeable point here is that EXPer is able to encrypt additional syntax element in the implemented SE (Section 4), so the Y-PSNR of EXPer is 6.00 dB (Table 6 (row 4, column 3)), less than AES-CFB, which is 6.07 dB for the mobile video in Table 7. The comparative PSNR results confirm that the proposed algorithm produces PSNR values almost equivalent to AES.
The SSIM of the encrypted video with combined motion and texture parameters is illustrated in Figure 9. The comparative results show that video sequences encrypted with EXPer and AES-CFB have smaller SSIM values than encryption with XOR. Lower SSIM values indicate more content

Comparison of EXPer with State-of-the-Art Ciphers
To confirm the value of EXPer, a comparison with the most commonly used encryption algorithms XOR and AES was performed. XOR can be considered the most suitable encryption algorithm for IoT applications due to its simplicity and lower computational complexity. However, baseline XOR provides limited confidentiality for images, due to potentially high cross-correlations, and, therefore, AES can be utilized to provide greater confidentiality. Though AES is robust against known-plaintext, brute force, and statistical attacks, it incurs a higher encoding and decoding overhead, which is expensive for resource-constrained IoMT devices.
Results were taken with both state-of-the-art ciphers to compare their performance with EXPer. Comparative visual results with XOR, AES-CFB [11], and EXPer with CABAC coding are presented in Figure 9. Figure 9b1-b3,c1-c3 and d1-d3 depict SE of the videos with three ciphers. The comparative results in Figure 9c vs. d imply that EXPer provides the same level of visual protection and robustness as AES. It is worth mentioning here that dQP encryption is not applied in these comparative results, as AES rounds make the dQP encrypted video non-format compliant.

Comparative Visual Quality Analysis
For video quality analysis of these three ciphers, PSNR and SSIM results were also taken. Table 7 is a PSNR comparison between XOR, AES-CFB, and EXPer with combined motion and texture parameter encryption on different QPs. The luminance (Y)-PSNR of the mobile video sequence is 7.18 dB, 6.07 dB, and 6.27 dB after encryption with XOR, AES and EXPer respectively. While a noticeable point here is that EXPer is able to encrypt additional syntax element in the implemented SE (Section 4), so the Y-PSNR of EXPer is 6.00 dB (Table 6 ( row 4, column 3)), less than AES-CFB, which is 6.07 dB for the mobile video in Table 7. The comparative PSNR results confirm that the proposed algorithm produces PSNR values almost equivalent to AES. The SSIM of the encrypted video with combined motion and texture parameters is illustrated in Figure 9. The comparative results show that video sequences encrypted with EXPer and AES-CFB have smaller SSIM values than encryption with XOR. Lower SSIM values indicate more content protection. Furthermore, from the evidence of Figure 10, EXPer provides almost the same level of confidentiality as AES-CFB. Hence, the PSNR and SSIM results imply that the encryption applied with EXPer provides confidentiality similar to that of AES.

Comparative Computational Efficiency
In addition to visual content protection, the efficiency of an encryption algorithm for real-time processing is dependent on the execution/encoding time. Therefore, to evaluate the efficiency of EXPer, a comparison with the encoding time of standard algorithms XOR and AES-CFB has been performed. The comparative results of Figure 10 show that the absolute encoding time for only motion parameters encryption and only texture parameters encryption with EXPer is 91.35 s and

Comparative Computational Efficiency
In addition to visual content protection, the efficiency of an encryption algorithm for real-time processing is dependent on the execution/encoding time. Therefore, to evaluate the efficiency of EXPer, a comparison with the encoding time of standard algorithms XOR and AES-CFB has been performed. The comparative results of Figure 10 show that the absolute encoding time for only motion parameters encryption and only texture parameters encryption with EXPer is 91.35 s and 87.34 s respectively, which is less than AES-CFB for the ICE video. Likewise, the encoding time for combined motion and texture parameters encryption is 89.41 s, lower than AES-CFB. The graphical results of Figure 11 indicate that the absolute encoding time with EXPer encryption is nearly equivalent to encryption with XOR. Thus, the efficiency of EXPer in terms of execution time is distinctly better than AES-CFB. EXPer provides an almost similar level of protection to that provided by AES-CFB but has a very small computational overhead.

Comparative Computational Efficiency
In addition to visual content protection, the efficiency of an encryption algorithm for real-time processing is dependent on the execution/encoding time. Therefore, to evaluate the efficiency of EXPer, a comparison with the encoding time of standard algorithms XOR and AES-CFB has been performed. The comparative results of Figure 10 show that the absolute encoding time for only motion parameters encryption and only texture parameters encryption with EXPer is 91.35 s and 87.34 s respectively, which is less than AES-CFB for the ICE video. Likewise, the encoding time for combined motion and texture parameters encryption is 89.41 s, lower than AES-CFB. The graphical results of Figure 11 indicate that the absolute encoding time with EXPer encryption is nearly equivalent to encryption with XOR. Thus, the efficiency of EXPer in terms of execution time is distinctly better than AES-CFB. EXPer provides an almost similar level of protection to that provided by AES-CFB but has a very small computational overhead.

Comparative Security Analysis
For the security analysis of EXPer, a comparison with the correlation coefficient of standard algorithms XOR and AES-CFB has been performed. Figure 12 shows comparative correlation coefficients of plaintext mobile video frame and encrypted mobile video frame. The results show that frame encrypted with the EXPer has pixel correlation coefficient value ρ = 0.06905118369984 which is almost equivalent to the pixel correlation value of the frame encrypted with AES-CFB, which is 0.068235502062. This result implies that EXPer has achieved the same level of randomness as

Comparative Security Analysis
For the security analysis of EXPer, a comparison with the correlation coefficient of standard algorithms XOR and AES-CFB has been performed. Figure 12 shows comparative correlation coefficients of plaintext mobile video frame and encrypted mobile video frame. The results show that frame encrypted with the EXPer has pixel correlation coefficient value ρ = 0.06905118369984 which is almost equivalent to the pixel correlation value of the frame encrypted with AES-CFB, which is 0.068235502062. This result implies that EXPer has achieved the same level of randomness as AES-CFB and outperform baseline XOR encryption, which has correlation value ρ = 0.511745682734. This demonstrates that the proposed EXPer has a greater potential to resist statistical attacks. AES-CFB and outperform baseline XOR encryption, which has correlation value ρ = 0.511745682734. This demonstrates that the proposed EXPer has a greater potential to resist statistical attacks.

Comparative Entropy Analysis
Entropy defines the uncertainty or the chaos level within video frames. It measures the amount of the gray level and the probability corresponding to the total information inside all other pixels and determines which pixels carry most of the information. It is calculated as: where f is the gray-level value and p(f) is the probability of f. Figure 13 is a comparative entropy histogram of mobile video frames encrypted with XOR, AES and EXPer. This entropy is evident in the spreading of more black and sharp colors (shown in Figure 5; Figure 9) across the video frames, compared to the original histogram values (Figure 13a) prior to applying SE with either one of the three ciphers to the selected syntax elements taken from the mobile video sequence.

Comparative Entropy Analysis
Entropy defines the uncertainty or the chaos level within video frames. It measures the amount of the gray level and the probability corresponding to the total information inside all other pixels and determines which pixels carry most of the information. It is calculated as: where f is the gray-level value and p(f) is the probability of f. Figure 13 is a comparative entropy histogram of mobile video frames encrypted with XOR, AES and EXPer. This entropy is evident in the spreading of more black and sharp colors (shown in Figure 5; Figure 9) across the video frames, compared to the original histogram values (Figure 13a) prior to applying SE with either one of the three ciphers to the selected syntax elements taken from the mobile video sequence.
where f is the gray-level value and p(f) is the probability of f. Figure 13 is a comparative entropy histogram of mobile video frames encrypted with XOR, AES and EXPer. This entropy is evident in the spreading of more black and sharp colors (shown in Figure 5; Figure 9) across the video frames, compared to the original histogram values (Figure 13a) prior to applying SE with either one of the three ciphers to the selected syntax elements taken from the mobile video sequence.

Cryptanalysis of Proposed ESR-Validated Security Scheme
This section contains a cryptanalysis of relevant attacks upon the SE method of this paper. A prior cryptanalysis of the SE method also appeared in the leading journal paper [55], reviewed by security specialists but aimed at a multimedia audience. Therefore, we now include an additional analysis, including an analysis of further attacks.

Differential Attack
In this type of attack, the attacker tries to guess the keys and sub-keys by investigation of the encrypted bitstream streams. However, the proposed algorithm is sensitive to any changes in the plaintext bitstream or in the secret keys or the offset value. Bit-level chaining dependency exists between the permutation and XOR sequential steps in EXPer, which makes it challenging to apply differential attacks. Hence the proposed algorithm produces a different encrypted output bitstream for the same plaintext input bitstream if any change in the secret key takes place. Thus, whenever an attacker tries to insert random keys through a brute force attack in order to recover the original bitstream, they find more distorted bitstream rather on any clues about the encrypted bitstream and the sub-keys, as demonstrated in Figure 14.

Known-Plaintext and Correlation Attack
EXPer is implemented on video syntax elements rather than on a single image. All tested videos of file size 43.5 ( Table 3) are comprised of 300 frames. Table 3; Table 4 show that the tested mobile

Cryptanalysis of Proposed ESR-Validated Security Scheme
This section contains a cryptanalysis of relevant attacks upon the SE method of this paper. A prior cryptanalysis of the SE method also appeared in the leading journal paper [55], reviewed by security specialists but aimed at a multimedia audience. Therefore, we now include an additional analysis, including an analysis of further attacks.

Differential Attack
In this type of attack, the attacker tries to guess the keys and sub-keys by investigation of the encrypted bitstream streams. However, the proposed algorithm is sensitive to any changes in the plaintext bitstream or in the secret keys or the offset value. Bit-level chaining dependency exists between the permutation and XOR sequential steps in EXPer, which makes it challenging to apply differential attacks. Hence the proposed algorithm produces a different encrypted output bitstream for the same plaintext input bitstream if any change in the secret key takes place. Thus, whenever an attacker tries to insert random keys through a brute force attack in order to recover the original bitstream, they find more distorted bitstream rather on any clues about the encrypted bitstream and the sub-keys, as demonstrated in Figure 14.

Cryptanalysis of Proposed ESR-Validated Security Scheme
This section contains a cryptanalysis of relevant attacks upon the SE method of this paper. A prior cryptanalysis of the SE method also appeared in the leading journal paper [55], reviewed by security specialists but aimed at a multimedia audience. Therefore, we now include an additional analysis, including an analysis of further attacks.

Differential Attack
In this type of attack, the attacker tries to guess the keys and sub-keys by investigation of the encrypted bitstream streams. However, the proposed algorithm is sensitive to any changes in the plaintext bitstream or in the secret keys or the offset value. Bit-level chaining dependency exists between the permutation and XOR sequential steps in EXPer, which makes it challenging to apply differential attacks. Hence the proposed algorithm produces a different encrypted output bitstream for the same plaintext input bitstream if any change in the secret key takes place. Thus, whenever an attacker tries to insert random keys through a brute force attack in order to recover the original bitstream, they find more distorted bitstream rather on any clues about the encrypted bitstream and the sub-keys, as demonstrated in Figure 14.

Known-Plaintext and Correlation Attack
EXPer is implemented on video syntax elements rather than on a single image. All tested videos of file size 43.5 ( Table 3) are comprised of 300 frames. Table 3; Table 4 show that the tested mobile

Known-Plaintext and Correlation Attack
EXPer is implemented on video syntax elements rather than on a single image. All tested videos of file size 43.5 ( Table 3) are comprised of 300 frames. Table 3; Table 4 show that the tested mobile video has an overall ESR equal to 13.26%, based on a total number of encrypted 6,057,747 syntax elements and exploitation of all the important video characteristics, such as motion, texture characteristics, and QP variable length values. This total is a large number of syntax elements, distributed across 300 frames. This number of elements and their distribution makes it very difficult for attackers to guess the values of those syntax elements correctly so as to render the encrypted video watchable. The syntax elements are independently encrypted and have no correlation with each other. Motion vectors only deal with motion of the video, while transform coefficients only affect the spatial resolution of the video, while the QP values are for controlling the bitrate. Therefore, MVD, TC sign encryption and dQP have no relation to each other. Consequently, correlation attacks have little hope of success.

Interference Attack
The entropy analysis performed in Section 5.4 implies that, if the videos are selectively encrypted by EXPer, they are highly distorted and it is difficult for an attacker to infer the presence of an object in any one of the R, G, B and luminance domains. The results, previously shown in Figure 13 for entropy, illustrate that the EXPer has attained confidentiality against inference attacks equivalent to encryption with AES-CFB.

Discussion and Limitations
A comparison of two recently proposed lightweight ciphers with EXPer appears in Table 2. Encryption is a trade-off between speed of computation and resistance to cryptanalysis. In [48] the authors selected the substitution and permutation principle (refer back to Section 2) for the encryption of static images for an IoT. However, substitution cannot be efficiently computed within the resource-limited devices of an IoMT specifically for videos and, therefore, should be used with caution. Likewise, the authors of [41,43] only incorporated shuffling or the permutation principle in their ciphers. However, those ciphers, based as they are on shuffling alone, are vulnerable to attack [61,62]. Similarly, XOR is considered as weak in security terms. However, it may be suitable in some circumstances within an IoMT, due to its simple implementation and fast execution. Moreover, a single permutation of the key makes it vulnerable to many attacks such as known-plaintext attacks. It is for that reason that the proposed EXPer is employed with XOR and permutations in successive rounds to obtain adequate confidentiality, in the sense that lightweight ciphers must trade-off real-time operation and complete invulnerability. Additionally, a 128-bit secret key has been utilized, which length is considered unbreakable until 2020 is reached (by virtue of the time needed to test every possibility). Table 8 shows the comparative analysis of EXPer with respect to the visual quality metrics with other two schemes.

Conclusions
Various classes of IoMT devices are utilized for multiple services such as stored video streaming (YouTube, online lectures), live video streaming in the cases of video conferencing, online gaming, and other real-time applications, with the most confidential being interactive video streaming in the form of surveillance applications. It is crucial to modern IoMT nodes to provide data confidentiality in the form of data encryption. The most reliable cipher is AES with 128/192/256 bit keys. However, AES is still not an optimal choice for low-powered surveillance devices with simple hardware. In this paper, by keeping in view current security needs, we propose an ESR-validated security scheme for IoMT devices. Within a security scheme, the contribution of the paper is to examine two alternative entropy coders available for the H.264/AVC codec, such as AVLC and CABAC in detail and determine the ESR when applying SE with the proposed cipher to encrypt the selected syntax elements. As identified herein, using the CABAC entropy coder (see Table 3) can considerably save on the ESR percentage that is the maximum is 14.14%, while the CAVLC is 31.23%, so that the equivalent average ESR is 12% for CAVLC and 7.5% for CABAC across the tested reference videos. The ESR calculated for CABAC is acceptable for IoMT applications as reduced encryption data consumes less computation during encoding. In the proposed security scheme, a novel cipher, EXPer, works on cryptographic basic principles such as permutation and XOR with three different 128-bit keys over selected syntax elements of CABAC encoder. EXPer even performs very well on absolute values of syntax elements that is dQP, without changing the bitrate and crashing the decoder (if the decoder is applied to encrypted video). Comparative analysis with the existing state-of-art ciphers shows that EXPer yields confidentiality almost similar to that of AES-CFB, but the computational cost is similar to the XOR, which makes it a suitable choice for protecting real-time video communication in an IoMT setting. Our detailed security analysis revealed that the proposed EXPer is robust enough against multiple attacks.
Future work based on taking measurements of ESR in this paper can provide a way to more precisely model the trade-offs between computational complexity, and memory access in terms of energy consumption within a video sensor device. Both entropy coders have a content dependency, which increases the effect of bit errors. This implies that error resiliency or channel coding should be built into transmission over an IoMT network along with proper key management solution in future.