1. Introduction
Video streaming is an essential technology with applications across diverse domains, including visual surveillance, traffic control, autonomous navigation video conferencing, and digital broadcasting [
1]. In many cases, video input is first compressed then sent to the cloud for further analysis. The approach taken often depends on the target application. For systems focused solely on machine-based video analysis, it is possible to transmit compressed precomputed features instead of the full video, which significantly reduces bandwidth requirements [
2,
3]. Recently both traditional hand-crafted methods and modern neural network-based techniques have been explored to achive this [
4,
5]. On the other hand, when human viewing is also required, the system must encode and transmit the original video, thereby increasing the overall complexity.
The rapid development of deep learning has introduced a new trend developing learned compression frameworks for images and videos using deep neural networks (DNNs). These models attempt to advance traditional coding standards such as JPEG, HEVC, and VVC [
6,
7,
8,
9,
10]. The majority of DNN-based compression techniques have been primarily optimized for human visual perception and not necessarily machine-focused analysis. While deep networks are widely used in visual perception and understanding problems, compression itself has seldom been the exclusive focus of such models [
9].
To bridge the gap, ref. [
11] introduced JPEG AI, a deep learning-based image standard coder that integrates human visual viewing and machine vision into a single bitstream. One significant benefit of JPEG AI over its previous versions is that it accommodates direct input of entropy decoded latent features into analytics based on DNN without the necessity of complete reconstruction of the image, thereby reducing computational cost. Nonetheless, the standard lacks the ability to separate task related from task unrelated content at the encoding stage. Therefore, in use cases where human eye observation is not common, JPEG AI is nevertheless not optimally bit efficient because its bitstream contains information undesirable for machine vision applications.
In order to go beyond these constraints, new standardization work initiatives such as MPEG-VCM (Video Coding for Machines) and MPEG-FCM (Feature Coding for Machines) have been lunched. These are aimed at defining harmonized frameworks for both catering human vision and machine vision. At the same time, several scalable image coding techniques for multi-tasking tasks have been introduced [
12,
13,
14]. They are normally in the format of layer-based where the lowest layer enables machine-centric analysis (e.g., tracking or object detection), and enhancement layers on top enable reconstruction for human visibility.
In video streaming, internet-delivered services have been supported by digital rights management (DRM) systems that manage copyrights and restrict access in a way that is controlled. Encryption is the primary activity of DRM to stop the unauthorized sharing of copyrighted video streams. End-to-end encryption helps ensure that only legitimate users are able to decrypt and watch the delivered content.
Recently, blockchain-inclined encryption schemes have been explored to enhance transparency and processing efficiency in secure video streaming. However, the majority of them are plagued with performance deterioration over unencrypted streaming. To address these issues, we present an adaptive scalable human–machine video coding (HMSVC) framework for encrypted video streaming applications. The presented approach degrades every video into a number of quality layers, as depicted in
Figure 1. Every layer has a specific resolution or degree of fidelity, which can be utilized to adjust transmission based on available bandwidth or capability of the receiver device. The video conferencing server selects dynamically decodes layers best suited for current network conditions and user requirements. Such a scheme ensures a seamless play and uniform visual look even with varying network constraints.
The major contributions presented in this paper are summarized as follows:
- (1)
We introduce a new adaptive, scalable encrypted system combining blockchain and region-of-interest (ROI)-based encoding for centralized video streaming usage. To our best knowledge, it is the first adaptation of adaptive blockchain-based scalable streaming for ensuring video integrity in such environments.
- (2)
To address the scalability problems of blockchain with large transactional loads, our design further includes a user-manageable feature for prioritizing critical videos to ensure safe delivery. The system can dynamically scale to meet the requirements of different video applications.
- (3)
The proposed ROI mechanism reduces computational overhead and memory usage in integrity verification, thereby allowing for more rapid operation.
- (4)
We conducted extensive experiments analysis of possible security risks and attacks and show that the suggested integrity verification framework is very secure and robust.
The remainder of this paper is organized as follows.
Section 2 presents the related works.
Section 3 describes the proposed methods experimental setup and evaluation metrics.
Section 4 shows the experimental configurations.
Section 5 discusses the experimental results and performance comparisons. Finally,
Section 6 concludes the paper and outlines future research directions.
3. Proposed Method
In this section, the proposed Scalable Human–Machine Coding (SHMC) framework is presented. First, the adaptive layer selection mechanism and then the joint optimization problem of human and machine vision are explained. After that, there is a discussion about framework overview, and selective and full encryption strategies.
3.1. Adaptive Selection and Optimization Formulation
To deal with the unique requirements of machine and human vision, video coding is modeled as a four branch optimization problem. The first branch, which is the machine branch, optimizes for feature distortion minimization given a bitrate constraint. Once this branch is optimized, the human branch is optimized to reconstruct high-quality video and with its own bitrate constraint.
This can be modeled mathematically as
where
and
denote the coding parameters for the machine and human branches, respectively;
and
represent the corresponding distortion metrics; and
,
,
, and
are the actual and maximum allowed bitrates for each branch. This formulation explicitly models the dependency of the human branch on the optimized machine parameters, ensuring efficient performance for both machine analysis and human visual reconstruction.
Figure 1.
Overview of the proposed adaptive SVHM framework. The input sequence presented in the figure is selected from the standard HEVC common test sequences [
48].
Figure 1.
Overview of the proposed adaptive SVHM framework. The input sequence presented in the figure is selected from the standard HEVC common test sequences [
48].
The proposed framework is shown in
Figure 1 integrates five layers semantic, structure, texture, selective encryption, and fully encryption layers for efficient secured video compression and reconstruction. The application selection can be implemented using the user interface choice.
Interface layer selection:
3.2. Framework Overview
As shown in
Figure 1, the resulting SHMC framework consists of five functional layers: semantic layer, structure layer, texture layer, selective encryption layer, and full encryption layer.
These layers cooperate to provide efficient, scalable, and secure compression and reconstruction of the video. The system also features a user interface that allows for dynamic choice of the target branch based on application needs.
3.3. Selection of Interface Layer
Based on the activation components defined in (3), the operating state of each branch is represented using a unified activation vector
. Each element of this vector corresponds to a specific processing layer, namely the semantic, structural, texture, selective encryption, and full encryption layers. This representation provides a consistent notation and explicitly describes how content characteristics determine the active processing and encryption strategy.
The first three elements of indicate the activation of the semantic, structural, and texture layers, while the last two elements specify the applied encryption mode. Depending on the semantic sensitivity of the content, either selective encryption or full encryption is enabled, ensuring that only one encryption strategy is active for a given branch.
Therefore, the active configuration for any branch is specified as Algorithm 1.
The GUI for interface selection enabling users to select adaptively between branches according to the application context and security requirements.
3.4. Semantic Layer
The semantic layer employs a Conditional Semantic Compression (CSC) network, drawing inspiration from the [
49] architecture, to compress high level semantic features from consecutive frames for machine understanding. Segmentation sequences preserve geometric structure by maintaining object boundaries and spatial relationships between frames. The optimization objective is as per Equation (
1).
The structure layer also records motion-based structural information through a motion estimation module. The lower quality frames are predicted using an Interlayer Frame Prediction (IFP) network, optimized using Equation (
2), while the motion information is propagated efficiently. The texture residuals are also enhanced through a U Net-based reconstruction module, which maintains visual details by optimizing perceptual quality estimation (PQE).
3.5. Texture Layer
The texture layer focuses on generating high fidelity video suitable for human vision [
49]. All the compression modules use a channel-wise auto-regressive (CAR) entropy model to compress quantized features into bitstreams.
| Algorithm 1 Branch-based layer selection with adaptive intrusion detection |
- Require:
Branch type - Require:
Intrusion flag - Ensure:
Activated layer list - 1:
- 2:
if then - 3:
▹ Force full encryption - 4:
else - 5:
if then - 6:
- 7:
else if then - 8:
- 9:
else if then - 10:
- 11:
else if then - 12:
- 13:
end if - 14:
end if - 15:
return
|
When selective encryption is turned on, the system selectively encrypts significant regions of every frame. The video stream is broken down into Coding Tree Units (CTUs), and the encoding process continues with normal procedures like intra/inter prediction, deblocking, Sample Adaptive Offset (SAO), transform, quantization, and entropy coding.
3.6. Selective Encryption Layer
Entropy coding is the final stage of the compression pipeline and is responsible for making data lossless by eliminating statistical redundancy in the syntax structures. In the High Efficiency Video Coding (HEVC) standard, this is done through Context-based Adaptive Binary Arithmetic Coding (CABAC).
CABAC is implemented in two modes: regular and bypass. In the normal mode, each binary symbol’s probability is determined according to the context established by symbols already encoded. Such a probability model is updated step by step as the encoding continues, and then a binary arithmetic encoder is used to generate the compressed bitstream. If encryption is applied in this mode, it may subtly alter the probability model so that there are minor variations in the overall bitstream size. Algorithms 2 and 3 show the steps of encryption and decryption in the intrusion scenario.
In bypass mode, all the syntax elements are treated as uniformly likely and separately encoded, resulting in faster processing. In comparison to the regular mode, encryption in bypass mode has no effect on compression efficiency or changes the bitstream probability model.
| Algorithm 2 Selective encryption scheme |
- Require:
Raw video V, secret key K, ROI map based on semantic layer - Ensure:
Encrypted HEVC/VVC-compliant bitstream - 1:
Extract frames from V - 2:
Detect sensitive objects and generate ROI masks - 3:
for each Coding Tree Unit (CTU) do - 4:
Identify syntax elements associated with ROI - 5:
Select bypass-coded elements (QTC, MVD, optional IPM) - 6:
for each selected syntax element do - 7:
Perform CABAC binarisation to obtain bin sequence - 8:
Obtain current Most Probable Symbol (MPS) - 9:
for each bin do - 10:
- 11:
- 12:
- 13:
Send directly to arithmetic encoder - 14:
end for - 15:
if syntax element is non-binary (QTC or MVD) then - 16:
Scramble magnitude using Linear Congruential function - 17:
- 18:
end if - 19:
end for - 20:
end for - 21:
Assemble arithmetic-coded bins into encrypted bitstream - 22:
return Encrypted bitstream
|
SE is applied prior to CABAC is entered to ensure content protection without format compliance compromise. The processing typically involves three main steps:
Step 1: Encryption Target Selection. The encryption begins with identifying what the syntax components to be encrypted. These must be within the HEVC framework to avoid format infringements. More often than not, Quantized Transform Coefficients (QTC) and Motion Vector Differences (MVD) are used in bypass mode since encrypting the signs of these introduces extensive visual distortion without compromising compression efficiency or decoding appreciably.
For more secure encryption, certain schemes also incorporate Intra Prediction Modes (IPM) of the normal mode, which adds marginally to bitstream size but improves distortion resilience. Other methods extend encryption to features like luma/chroma IPM, reference frame indices, and SAO parameters, but exclude parameters such as Delta QP for ensuring codec stability. In our scheme, encryption is region of interest (ROI)-based and aimed at syntax elements correlated to sensitive spatial areas.
Step 2: Encryption Key Generation. After selecting the target elements, a master stream is encrypted with a cipher of cryptography. As each Coding Tree Unit (CTU) in HEVC is coded independently, stream ciphers such as AES (in CTR or CFB modes), RC6, or chaotic map-based algorithms would be best. The resulting key enables secure and synchronized encryption and decryption at encoder and decoder.
Step 3: Bitstream Encryption. Last but not least, the resultant bitstream is encrypted right after entropy coding. The hidden syntax components are modified using the derived key typically by XOR operations for binary and Linear Congruential (LC) functions for non binary. More sophisticated schemes scramble values of MVD and QTC as well as this to give more visual obfuscation without destroying codec compliance.
| Algorithm 3 Selective decryption scheme |
- Require:
Encrypted bitstream, secret key K, ROI map - Ensure:
Reconstructed video - 1:
Initialise CABAC decoder - 2:
Regenerate ROI masks - 3:
for each Coding Tree Unit (CTU) do - 4:
Identify encrypted syntax elements - 5:
for each encrypted syntax element do - 6:
Obtain current Most Probable Symbol (MPS) - 7:
for each encrypted bin do - 8:
- 9:
- 10:
- 11:
Send to arithmetic decoder - 12:
end for - 13:
if syntax element is non-binary (QTC or MVD) then - 14:
Apply inverse Linear Congruential function - 15:
- 16:
end if - 17:
end for - 18:
end for - 19:
Perform standard inverse transform and prediction - 20:
return Reconstructed video
|
After these operations, the encrypted bitstream is saved or transported securely. During decoding time, a Homomorphic Decryption Decoder (HDD) recreates the key stream, decrypts the syntax elements, and undergoes regular decoding processes such as inverse transform and prediction to exactly rebuild the original video.
Figure 2 demonstrates the major pipeline for CABAC encryption implementation.
The most important advantage of SE over HEVC is that selective encryption (SE) offers a very efficient way of protecting video data without making it any less compatible with the HEVC standard. An encrypted video is still decodable by a standard HEVC decoder because SE leaves the format of the encoded bitstream unchanged. This is one of the most important strengths of SE and makes SE very flexible and easy to combine with other content protection schemes such as watermarking or information hiding.
Low Bit Rate Impact: If encryption is applied to syntax elements in bypass mode, there is no effect on bit rate that can be perceived. Even when encrypting regular mode elements, for instance, the Quantized Transform Coefficients (QTC), the bit rate increase is generally less than 0.1%, and in most practical applications, this is negligible.
Visual Protection: SE effectively obscures important visual data through selectively encrypting features such as QTC, Motion Vector Differences (MVD), and Intra Prediction Modes (IPM). This distortion obscures important texture and motion information, providing an effective layer of privacy protection for sensitive visual data.
Efficiency and High Speed: Since the encryption is performed before the CABAC process and involves lightweight stream ciphers, SE enjoys high processing speed. The additional computational overhead is negligible around 1.5–2% and hence the method can be used for real-time systems as well as video’s large-scale applications.
To quantitatively analyze the interaction between selective encryption (SE) and CABAC, we evaluated the bitrate variation introduced by encrypting different classes of syntax elements under both regular and bypass coding modes. Since CABAC employs adaptive probability modeling in regular mode, encryption applied to these elements may slightly affect context adaptation, whereas bypass-coded elements are encoded assuming uniform probability and are therefore insensitive to encryption.
In the proposed framework, encryption is primarily applied to bypass-coded syntax elements such as Quantized Transform Coefficients (QTC) and Motion Vector Differences (MVD). As a result, the CABAC probability model remains unchanged, and no observable impact on compression efficiency is introduced. When encryption is extended to a limited set of regular-mode elements, such as selected Intra Prediction Modes (IPM), only minor perturbations in context modeling are observed, leading to a negligible bitrate increase.
Table 1 reports the bitrate overhead introduced by the proposed SE scheme under Random Access (RA), Low Delay (LD), and All Intra (AI) configurations. Across all tested sequence classes and coding scenarios, the bitrate increase consistently remains below 0.1%. These results confirm that the proposed selective encryption strategy introduces negligible impact on compression efficiency while preserving CABAC compatibility.
3.7. Choice of Encryption Algorithm
AES in counter (CTR) mode is employed in the proposed framework due to its low latency, stream-oriented operation, and suitability for real-time video coding. AES-CTR enables encryption through simple XOR operations and exhibits predictable computational complexity. In addition, AES benefits from widespread hardware support and efficient implementations on both general-purpose processors and embedded platforms.
In contrast, chaotic map-based encryption schemes typically rely on iterative and floating point computations, which introduce higher latency and complicate hardware implementation. These characteristics limit their applicability in real-time CABAC-based video compression pipelines. Therefore, AES-CTR provides a practical balance between security, efficiency, and hardware feasibility. Overall, the proposed selective encryption framework ensures effective content protection while maintaining CABAC coding efficiency, format compliance, and low computational overhead.
As shown in
Table 2, the proposed method substantially reduces the time needed for privacy-sensitive pixel extraction compared with DeepSVC [
49] across all tested sequences. This efficiency improvement is especially notable for high-motion content such as Football and Surfing, where the proposed approach processes regions in a fraction of the time required by DeepSVC. In our framework, sensitive regions within video frames are automatically identified using object detection models, enabling dynamic recognition and tracking of privacy-sensitive content such as faces or license plates. This automated approach eliminates the need for manual intervention and ensures that selective encryption adapts to scene changes and moving objects.
By encrypting only the detected regions, computational overhead is significantly reduced compared with full-frame encryption, while preserving visual privacy. The system is also designed to manage occlusions and multiple moving targets, maintaining consistent privacy protection in complex or dynamic environments.
3.8. Fully Encrypted Layer
For scenarios where there is a need for increased security, such as video surveillance or other sensitive applications, the fully encrypted layer is activated. The video stream itself is initially divided into three second segments. A hash and a Message Authentication Code (MAC) are created for each segment using the HMAC algorithm. This is a security measure so that each segment can be independently verified for integrity and authenticity throughout transmission or storage.
In order to verify the video segments, SHA-256 hash is initially determined for each segment and then an HMAC key is generated with a Message Authentication Code (MAC) using an HMAC algorithm and a randomly chosen key. Each video segment has an HMAC key, which is kept confidential and intact. At the block key generation stage, a block level HMAC key is calculated by performing randomized hashing on the HMAC key of the segment. Results from both the data encryption and key generation processes are then cached in the blockchain and buffered. Each blockchain block links to later and earlier blocks and contains information such as the video segment’s path, the Video Integrity Code of the segment (VIC), an ephemeral ECC public key, a timestamp, and the HMAC of the previous block.
The main encryption process uses elliptic curve cryptography (ECC) to encrypt the HMAC key and MAC of the video to offer additional security and protection from illegal tampering as shown in
Figure 3. The resulting process is referred to as the Fully Encrypted Video Integrity Code (FVIC) in the proposed encoder. In elliptic curve cryptography (ECC), mathematical operations are executed over finite fields, which may be categorized as either prime or binary fields. Such fields are denoted by GF
, where
, with
p representing a prime number and
m a positive integer.
When and p is an odd prime, the field is referred to as a prime finite field, written as GF. On the other hand, if and , the structure forms a binary finite field, or equivalently, a characteristic two field, expressed as GF.
For fields of characteristic neither two nor three, the elliptic curve can be represented as the reduced Weierstrass form:
The discriminant:
has to differ from 0 in order to define a valid curve. Over a finite field, the elliptic curve is given by
where ≡ represents modular equivalence.
Every elliptic curve is associated with parameters
, where
G is the generator point,
n its order, and
h the cofactor. ECC Operations and Secure Communication
Figure 4 shows the main ECC operations point addition, subtraction, and doubling. For safe communication between two parties,
A and
B, each selects a common curve and generator point
G. Their private keys,
and
, lead to public keys
and
, respectively.
For message
, sender
A performs encryption with
B’s public key:
where
k is randomly chosen for each encryption.
B recovers the original message by computing the following:
Since a new random
k is selected with each encryption, even identical messages yield different ciphertexts, which increases the security of the scheme against replay and pattern attacks.
ECIES Integration
For further security of keys, the suggested model adopts the Elliptic Curve Integrated Encryption Scheme (ECIES) [
50], which integrates ECC-based public key cryptography with symmetric encryption for efficiency considerations. The key relationship is given as
where
and
are the private key and the public key, respectively. For every encryption session, a temporary key pair is generated as follows:
This renders every encryption session unique, immune to interception, and secure in case of compromised previous session keys.
3.9. Computational Overhead of Blockchain-Based Verification
The fully encrypted layer incorporates blockchain-based verification using ephemeral elliptic curve cryptography (ECC) keys and HMAC authentication at the segment level. To evaluate the computational overhead introduced by this design, we analyze the additional processing cost incurred during encryption, verification, and playback, and compare it with conventional digital rights management (DRM) mechanisms.
In the proposed framework, ephemeral ECC key generation and HMAC computation are performed once per video segment rather than per frame. As a result, the cryptographic overhead scales linearly with the number of segments and remains independent of the video resolution or frame rate. Compared to traditional DRM systems that rely on persistent key exchanges or license server interactions, the proposed approach avoids repeated handshake operations during playback.
Experimental evaluation as presented in
Table 3 shows that the additional processing time introduced by ECC key derivation and HMAC verification accounts for a small fraction of the overall encoding and decoding time. In practice, the average overhead remains within a few milliseconds per segment, which is negligible relative to segment duration and does not affect real-time decoding performance. These results indicate that the proposed verification mechanism achieves improved security with acceptable computational cost.
4. Experiment Configurations
To comprehensively evaluate the proposed framework, experiments are conducted on datasets representing diverse application domains. Surveillance-oriented sequences include static and dynamic scenes with human activities and occlusions, enabling evaluation of privacy preservation and visual distortion. Video conferencing sequences emphasize facial regions and low-latency requirements, while autonomous driving sequences contain complex motion, object interactions, and machine vision tasks such as object detection and tracking.
All datasets are encoded using common test conditions with resolutions ranging from 720p to 1080p and frame rates between 30 and 60 fps, ensuring fair and reproducible comparisons across scenarios.
We evaluate our proposed end-to-end method for the four tasks in machine vision, focusing on video object detection and action recognition. For object detection, experiments are conducted on the ImageNet VID dataset, which includes 3862 training video snippets and 555 validation videos. For action recognition, we evaluate on UCF-101 [
51], where videos are organized into 25 groups, each containing 4–7 clips of a specific action; clips within the same group may share features like similar backgrounds, viewpoints, or action categories. Additionally, we use HMDB-51, consisting of 6766 clips spanning 51 action categories. Our training and evaluation procedures follow the standard protocols established by MMTracking and MMAction2.
To evaluate the effectiveness of our framework for human viewing, the structure and texture modules are trained on the Vimeo-90k dataset. Video reconstruction performance is then assessed using the HEVC common test conditions, covering different classes using sequences corresponding to UrbanScenes, CityHighway, SuburbStreets, CampusWalk, ParkTrails, MountainStreams, MixedScenesAvg, and IndoorMotion. For tasks involving both object detection and frame reconstruction, datasets containing raw (uncompressed) videos along with corresponding object annotations are required. Currently, only two datasets meet these requirements: SFU HW Objects v1 [
52] and TVD [
53], both of which have been used within the MPEG-VCM standardization efforts [
54]. The SFU HW Objects v1 dataset consists of raw YUV420 video sequences and has previously played a role in the development of video coding standards, including HEVC [
16] and VVC [
18].
The proposed SHMC framework was implemented on an NVIDIA RTX 2080Ti GPU. Experimental results show that our approach surpasses existing state-of-the-art methods in detecting a variety of tampering types including copy, move, insertion, and deletion attacks while maintaining high accuracy and robustness in verifying video.
5. Evaluation Metrics
Following the approach adopted in many recent studies [
1,
35,
37,
43,
55], we evaluate the proposed approach performance using the following criterias: For video reconstruction, we report the bitrate in bits per pixel (bpp) and assess quality through RGB peak signal to noise ratio (PSNR) and RGB multi-scale structural similarity (MS-SSIM) [
43]. In terms of object detection, we use mean average precision (mAP), a common benchmark. To calculate mAP, the average precision (AP) is determined for each class as the area under its precision recall curve. Precision measures the proportion of detections that are correct, whereas recall represents the proportion of true objects successfully identified. The final mAP value is obtained by averaging the APs across all classes, producing a single metric that reflects overall detection performance.
To evaluate the effectiveness of the proposed adaptive intrusion detection system, we conducted BD-rate reduction experiments under various intrusion scenarios. Five different types of intrusions were simulated on the user interface under two baseline conditions: without selective intrusion and with full intrusion. The adaptive system was then applied, dynamically adjusting its operation based on the application context and the current intrusion situation.
The results show that the adaptive approach substantially reduces BD-rate compared to both baseline conditions, demonstrating enhanced efficiency and responsiveness in handling intrusions. This improvement underscores the benefits of context-aware adaptation in minimizing the impact of intrusions on system performance, as illustrated in
Figure 5.
5.1. The Human Vision Applications Selection
Recent work on video understanding and multi-modal analysis has explored the fusion of visual and auxiliary features to improve machine interpretation accuracy. In addition, integration of system components in secure multimedia pipelines has been explored in broader engineering contexts [
32,
56].
Figure 6 illustrates the BD-rate reduction results for SHMC, evaluated on both texture and structure layers in terms of rate-distortion (RD) performance. The PSNR values along with the corresponding bits per pixel (bpp) for different methods are plotted over the HEVC common test sequences. It is clear that SHMC consistently achieves significant BD-rate reductions compared to other methods, demonstrating improved compression efficiency while maintaining high visual quality.
These gains are observed in both the texture and structure layers, underscoring the framework’s ability to preserve fine details as well as structural information. Overall, the RD curves indicate that SHMC strikes an effective balance between bitrate and reconstruction quality, thus making it a reliable and cost effective technique for scalable video coding applications.
Figure 7 shows the MS-SSIM performance of SHMC on both the structure and texture layers, presented through rate distortion (RD) curves. The plots show MS-SSIM values along with the corresponding bits per pixel (bpp) for various methods applied to the HEVC common test sequences.
The results indicate that SHMC consistently outperforms other approaches, achieving higher MS-SSIM values at comparable or even lower bitrates. This demonstrates the proposed approach to preserve perceptual quality more efficiently. Performance improvements are evident across both texture and structure layers, ensuring that SHMC maintains both visual fidelity and structural detail. Overall, the RD curves highlight the effectiveness of SHMC in balancing compression efficiency with high-quality visual reconstruction in scalable video coding scenarios.
As shown in
Figure 8, the proposed selective encryption is illustrated using the BasketballPass sequence, where the visual impact of encrypting regions of interest (ROI) can be clearly observed.
5.2. The Machine Analysis Applications Selection
The BD-rate performance of the proposed method in comparison with conventional and learning-based codecs across multiple video collections under the random-access configuration is presented in
Table 4. The proposed method consistently achieves the largest BD-rate reductions across all datasets, indicating superior compression efficiency relative to both traditional codecs (VP9 [
57], HEVC [
16], and VVC [
18]) and existing learning-based approaches. The performance evaluation of the proposed SHMC in video action recognition, measured on the UCF101 and HMDB51 datasets.
Table 5 summarizes the performance of various video coding methods, our proposed method across HEVC Classes B, C, and D. The results are reported in terms of BDBR measured on both PSNR and MS-SSIM versus bpp curves.
Our proposed method consistently achieves superior compression efficiency, with lower average BDBR values of −30.76% (PSNR) and −60.29% (MS-SSIM) across all classes. For Class B, our approach reduces the bit rate while maintaining high fidelity, outperforming most prior methods. Similarly, in Class C and Class D sequences, the proposed method demonstrates significant gains, particularly in challenging low-resolution Class D videos, indicating robustness across different content types. Overall, the results confirm that our method not only improves rate distortion performance over conventional codecs such as HEVC and VVC but also surpasses state-of-the-art learned video compression techniques like DeepSVC.
Table 6 presents Top-1 and Top-5 accuracy (%) along with the corresponding bits per pixel (bpp) for different methods, and our proposed method consistently achieves the highest Top-1 accuracy, reaching 79.51% on UCF101 and 42.97% on HMDB51, while maintaining the lowest bpp compared to other methods. This demonstrates that SHMC provides superior semantic layer representation, delivering higher action recognition accuracy with greater compression efficiency.
5.3. Selective Encryption Applications
In this part, we illustrate the experimental evaluation of the proposed selective encryption scheme for video streaming. The experiments are designed to assess both the effectiveness of encryption in protecting sensitive regions and the visual quality of the unencrypted areas. We focus on the selective encryption of regions of interest (ROI) in video sequences, demonstrating how our approach can securely conceal critical content while preserving the perceptual integrity of the surrounding scene. To illustrate the practical performance, we provide subjective visual results on the BasketballPass sequence, highlighting how the proposed method effectively encrypts only the ROIs while leaving other regions visually intact. Both subjective visual results and objective metrics are provided to demonstrate the advantages of our approach in terms of security, efficiency, and visual fidelity.
Figure 8 illustrates the subjective visual results of the BasketballPass sequence under selective ROI encryption. This selective encryption preserves the overall scene context, ensuring that non-sensitive areas remain perceptible, while the targeted regions are securely protected. The results clearly demonstrate that our approach achieves high visual security in the ROI without introducing noticeable artifacts or degrading the quality of the unencrypted regions, confirming the effectiveness and practicality of the proposed selective encryption scheme.
5.4. Intrusion Scenarios
A quantitative evaluation of the impact of intrusion scenarios on video quality under compression was performed, focusing on their combined effect on both human visual perception and machine-oriented analysis. The performance of the intrusion system is assessed by comparing rate distortion behavior before and after intrusion across multiple compression settings, using PSNR as an objective quality metric referenced to the original clean frames. The five selected scenarios are as follows.
5.4.1. Brightness Shift Intrusion
Brightness shift intrusion refers to the intentional manipulation of video frame luminance, causing frames to appear significantly brighter or darker than their natural exposure. This type of attack alters the overall visibility of a scene, often leading to washed-out highlights or obscured shadow regions. From a security perspective, excessive brightness can hide critical details such as facial features or license plate numbers, while reduced brightness can conceal objects entirely. For machine vision systems, brightness distortion disrupts pixel intensity distributions, leading to unstable feature extraction and reduced confidence in object detection or tracking models. In practice, this attack may occur through malicious control of camera gain settings or deliberate exposure tampering, especially in outdoor surveillance systems where lighting conditions are assumed to be trustworthy.
5.4.2. Contrast Distortion Intrusion
Contrast distortion intrusion involves modifying the difference between light and dark regions in a frame, either flattening the visual structure or exaggerating intensity variations. When contrast is reduced, object boundaries become less distinguishable, making it difficult for both humans and algorithms to separate foreground objects from the background. Conversely, excessive contrast can saturate details and introduce artificial edges that confuse machine perception. This type of intrusion is particularly dangerous in automated monitoring systems, as many vision models rely on contrast-driven gradients to detect shapes and motion. A realistic attack scenario includes tampering with camera processing pipelines or injecting altered video streams that subtly reduce contrast, thereby degrading detection accuracy without raising immediate suspicion.
5.4.3. Color Cast Manipulation Intrusion
Color cast manipulation intrusion refers to the deliberate alteration of color balance by disproportionately amplifying or suppressing specific color channels. This results in an unnatural color tint across the entire scene, such as a strong blue, green, or yellow bias. For human observers, color cast distortion affects visual realism and can obscure important cues like skin tone or object material. For machine learning models, which often rely on color consistency for classification and recognition, this intrusion can significantly degrade performance. Such attacks are particularly harmful in systems trained under normal lighting assumptions. In real world scenarios, color cast manipulation may be introduced through white balance tampering, colored light sources, or adversarial interference in camera firmware.
5.4.4. Gamma Distortion Intrusion
Gamma distortion intrusion targets the nonlinear relationship between pixel intensity and perceived brightness, reshaping how mid tone values are represented. This attack does not simply make the frame brighter or darker but redistributes intensity levels in a way that suppresses critical structural information. As a result, important visual details may appear flattened or overly enhanced, misleading human interpretation. For machine vision systems, gamma distortion disrupts learned feature hierarchies, particularly in convolutional neural networks that depend on consistent intensity patterns. A plausible attack scenario involves manipulating camera gamma correction parameters or injecting pre processed video streams that intentionally distort tonal responses, thereby reducing recognition reliability while maintaining plausible visual quality.
5.4.5. Sensor Noise Injection Intrusion
Sensor noise injection intrusion simulates electronic interference or low-quality sensor behavior by introducing random intensity fluctuations into video frames. This form of attack degrades image clarity by corrupting fine details and edges, which are essential for both human interpretation and automated analysis. While humans may perceive the scene as grainy or unstable, machine learning models are especially vulnerable, as noise disrupts gradient consistency and increases false detections. Sensor noise injection can occur through electromagnetic interference, low-light manipulation, or deliberate signal corruption in the acquisition pipeline. This intrusion is particularly dangerous because it can degrade system performance gradually, making detection difficult while continuously reducing the analytical accuracy.
Figure 5 illustrates visual comparison of frames captured before and after an intrusion event in the intrusion detection scenario, demonstrating successful intrusion detection by the proposed system.
5.5. Fully Encryption Applications
The experimental results demonstrate the effectiveness of the proposed full encryption in real-time video streaming application.
Table 7 presents a detailed comparison of the proposed method against several state-of-the-art approaches in terms of visual quality and computational performance. Metrics such as PSNR, MSE, BER, VQM, SSIM, and VMAF evaluate the reconstruction quality, while encryption and decryption times assess computational efficiency. The proposed method consistently achieves the highest PSNR (42.9 dB), lowest MSE (
), and top SSIM (0.93), indicating superior fidelity. Additionally, it maintains a very low BER and high VQM and VMAF scores, demonstrating robustness and perceptual quality. In terms of efficiency, the proposed approach offers competitive encryption and decryption times compared to other methods, highlighting a favorable balance between performance and speed. Overall, the results confirm that the proposed framework outperforms existing techniques across both quality and computational metrics.
In
Figure 9 and
Figure 10, the sequences test results demonstrate the comparative performance of VVC, HEVC, Deep Scalable Video Coding (DSVC), and our proposed method in terms of PSNR and bitrate. VVC achieves the highest coding efficiency, delivering superior PSNR at significantly lower bitrates compared to HEVC, which requires higher bitrates to reach similar visual quality. DSVC provides a scalable coding structure, offering competitive PSNR while enabling layer wise decoding for adaptive streaming scenarios.
Our proposed method further improves on these results using the multiframe enhancement, achieving PSNR comparable to or slightly higher than VVC while reducing the bitrate even further, highlighting its ability to maintain high visual quality at reduced bandwidth. These results confirm the effectiveness of our approach in enhancing compression efficiency and visual fidelity, making it particularly suitable for applications requiring high-quality, low-bitrate video streaming or selective layer decoding.
6. Conclusions
Secure high-quality reconstruction and accurate machine analysis are critical for video streaming applications. Storage and bandwidth resources are often constrained while meeting extensive surveillance demands. To effectively manage system resources, such as computing, caching, and communication, a highly efficient video codec becomes essential. This paper goes a step further by designing flexible adaptive reconstruction networks for video streaming. To address this, we introduce SHMC. On the encoder side, SHMC compresses videos into semantic, structural, and textural layers, with each representation extracted and encoded into compact, scalable bitstreams. The decoder can then selectively reconstruct partial bitstreams for semantic analysis or utilize additional layers for high quality visual reconstruction, depending on the application. In the case of intrusion detection, the encrypted layer is enabled, and following decryption, the multiframe enhancement module refines the reconstructed video stream. For machine analysis tasks, the selective encoder is employed. Experimental results demonstrate that DeepSVC consistently surpasses both conventional and learning-based codecs in terms of video reconstruction quality and machine analysis accuracy.
With this design, SHMC achieves efficient and high-quality recovery, marking the first end-to-end deep model capable of large-scale reconstruction and bringing the technology closer to real world use. Beyond surveillance and video conferencing, the proposed framework is applicable to a wide range of video-centric domains with heterogeneous requirements. In telemedicine, selective encryption can protect sensitive patient information while preserving diagnostically relevant visual features for clinical analysis. In autonomous driving, scalable representations enable efficient transmission of perception-critical regions to support object detection and tracking under bandwidth constraints. For live broadcasting and streaming applications, the framework can balance visual quality, latency, and security by adapting scalable layers to network conditions and audience requirements. These examples illustrate the flexibility of the proposed approach and its potential for deployment across diverse real-world video delivery scenarios. Despite the promising results demonstrated in this study, several open issues remain and will be addressed in future work. First, although the proposed framework shows strong performance across diverse datasets, further validation on ultra-high-resolution content (e.g., 4 K and 8 K videos) and under more stringent real-time constraints would provide deeper insights into its scalability. Second, the current implementation focuses on selected machine vision tasks; extending the framework to additional downstream applications such as video segmentation and multi-object tracking represents a valuable direction for future study. SHMC is not tied to a specific backbone; different deep models can be integrated into the framework. Another open direction is enabling fast adaptation to varying compression ratios, which will be explored in future work. Beyond surveillance and video conferencing, the proposed framework is applicable to a wide range of video-centric domains with heterogeneous requirements. In telemedicine, selective encryption can protect sensitive patient information while preserving diagnostically relevant visual features for clinical analysis. In autonomous driving, scalable representations enable efficient transmission of perception-critical regions to support object detection and tracking under bandwidth constraints. For live broadcasting and streaming applications, the framework can balance visual quality, latency, and security by adapting scalable layers to network conditions and audience requirements. These examples illustrate the flexibility of the proposed approach and its potential for deployment across diverse real-world video delivery scenarios.