1. Introduction
Volumetric video [
1] has emerged as a transformative technology for creating immersive, interactive experiences in applications such as virtual reality, telemedicine, and remote education. Unlike traditional 2D video, which represents flat images, volumetric video captures 3D data, often in the form of point clouds or textured meshes, allowing users six degrees of freedom (6-DoF) to explore scenes from different perspectives. This capability is particularly promising for telemedicine, where volumetric video can enable remote surgical consultations, collaborative diagnostics, and immersive medical training. However, the substantial data size and real-time processing demands associated with volumetric video pose significant technical challenges, particularly in bandwidth consumption and latency tolerance. These limitations make efficient and responsive streaming a critical task, especially for time-sensitive medical applications where low latency is essential for seamless interactions.
Volumetric video [
1] has emerged as a transformative technology for creating immersive, interactive experiences in applications such as virtual reality, telemedicine, and remote education [
2]. Unlike traditional 2D video, which represents flat images, volumetric video captures 3D data, often in the form of point clouds or textured meshes, allowing users six degrees of freedom (6-DoF) to explore scenes from different perspectives. This capability is particularly promising for telemedicine, enabling applications like remote surgical consultations, collaborative diagnostics, and immersive medical training. For instance, in remote surgery scenarios, volumetric video can transmit the surgeon’s precise hand movements and the patient’s real-time 3D anatomy. However, this rich data stream simultaneously poses a significant risk, as it could expose critical procedural details or patient biometrics if intercepted.
Despite its immense potential, the widespread adoption of volumetric video in telemedicine faces critical technical hurdles: substantial data size and stringent real-time processing demands. These factors lead to significant challenges in bandwidth consumption and latency tolerance. Efficient and responsive streaming is crucial, especially for time-sensitive medical applications where low latency (e.g., sub-50 ms Motion-to-Photon latency) is essential for seamless and safe interactions.
More critically, the detailed nature of volumetric data introduces severe privacy risks that directly conflict with stringent healthcare regulations like HIPAA and GDPR. The transmitted data streams can expose a wealth of sensitive information. For patients, volumetric data often contains detailed biometric identifiers, such as facial contours and gait patterns, which could be exploited for unauthorized user identification [
3]. Moreover, the transmission of raw or processed 3D medical data, such as DICOM files, risks leaking sensitive content, including patient identifiers or specific anatomical details [
4]. From the clinician’s perspective, head and gaze tracking data can inadvertently reveal psychological states, decision-making processes, or procedural expertise. Studies have shown that such tracking data can be linked to medical conditions like autism [
5], PTSD [
6], and even be used in diagnosing dementia [
7,
8]. In a medical context, this could expose professional vulnerabilities. These multifaceted privacy risks highlight the urgent need for a robust, specialized privacy-preserving solution that can safeguard data for both patients and clinicians without compromising the real-time performance essential for advanced telemedicine applications.
To address these intertwined challenges of privacy, efficiency, and real-time performance, we propose SecureTeleMed, a dual-track encryption scheme specifically tailored for volumetric video-based telemedicine. SecureTeleMed employs two key privacy-preserving mechanisms: viewport obfuscation and frame-wise encryption with region of interest (ROI) mapping. Viewport obfuscation selectively encrypts high-priority segments in the viewport data stream based on a calculated Prediction Contribution Value (PCV), which identifies segments that are both critical for predicting future user behavior and highly sensitive from a privacy perspective. This targeted approach protects sensitive data, such as clinician head movements or patient biometrics, while maintaining the ultra-low latency required for real-time medical interactions. Additionally, SecureTeleMed uses frame-wise encryption with ROI mapping, where user interest is projected from the 3D space onto the 2D frame, assigning an interest score to each tile. High-interest areas, such as anatomical regions in 3D medical scans, receive stronger encryption, ensuring robust content privacy without overburdening the system’s computational resources.
By integrating these complementary techniques, SecureTeleMed offers a comprehensive solution for securing both user behavioral data (viewport) and rendered content (frames) in real-time telemedicine services. This adaptive approach not only meets stringent privacy requirements but also supports efficient streaming performance under tight latency constraints, making it particularly suitable for high-stakes medical applications. The contributions of this work include the following:
Dual-Track Encryption Scheme: A novel scheme that protects both viewport trajectory data and rendered 2D frames, comprehensively addressing the distinct privacy challenges inherent in volumetric video streaming for telemedicine.
ROI-Guided Selective Encryption: A dynamic mechanism that assigns encryption levels based on real-time user focus (ROI), effectively balancing robust privacy protection with the computational efficiency demanded by real-time medical interactions.
Adaptive Frame-Wise Encryption: An encryption strategy where strength is varied according to each tile’s calculated interest score, optimizing data protection while maintaining the real-time performance necessary for medical data streams.
The remainder of this paper is structured as follows:
Section 2 reviews related work in volumetric video, telemedicine, and privacy preservation techniques.
Section 3 details the methodology of SecureTeleMed, including viewport obfuscation and frame-wise encryption mechanisms.
Section 4 presents experimental results evaluating encryption efficiency and privacy-preserving performance.
Section 5 discusses limitations and future work, and
Section 6 concludes the paper.
3. Method
This section details the technical framework of SecureTeleMed, a dual-track encryption scheme designed to address the unique privacy and real-time challenges of volumetric video in telemedicine. It first outlines the overall workflow integrating viewport obfuscation and frame-wise encryption with ROI mapping, then elaborates on each core mechanism: viewport obfuscation (which uses Prediction Contribution Value to selectively encrypt sensitive segments) and frame-wise encryption (which dynamically assigns encryption levels based on user perception sensitivity scores). These techniques are designed to balance robust privacy protection with the ultra-low latency required for real-time medical interactions.
3.1. Overview
Building on existing privacy-preserving solutions for video streaming, we propose SecureTeleMed, a system designed to secure both user behavior and video transmission in real-time telemedicine services. SecureTeleMed addresses the unique privacy challenges of telemedicine applications, such as remote surgery and collaborative diagnostics, by safeguarding sensitive information throughout the streaming process. As illustrated in
Figure 1, SecureTeleMed integrates two key mechanisms: viewport obfuscation and frame-wise encryption with region of interest (ROI) mapping. These mechanisms ensure robust privacy protection while maintaining the low latency requirement for real-time medical interactions.
Viewport Obfuscation: In SecureTeleMed, we employ a cloud rendering architecture to transform the user’s 3D scene perception into 2D frames based on the uploaded viewport trajectory. To ensure a smooth viewing experience, the Motion-to-Photon (MTP) latency must be kept below 50 ms [
16], leaving limited room for implementing protection schemes. To address this, we introduce
viewport obfuscation, which selectively encrypts high-priority segments of the viewport data stream. This approach protects sensitive user data, such as clinician head and gaze movements, without compromising real-time performance.
Frame-wise Encryption: We devise frame-wise encryption to protect sensitive regions of rendered 2D frames. Using a User Region of Interest Occupancy Encryption approach, our SecureTeleMed scheme dynamically identifies and encrypts high-interest areas based on user attention scores. This avoids over-encryption, reduces computational overhead, and ensures low latency, which is critical for balancing real-time performance and data security in telemedicine.
3.2. Viewport Obfuscation
Volumetric video streaming in telemedicine demands ultra-low latency (under 50 ms) for tasks like remote surgery, limiting encryption overhead to maintain real-time performance. Considering these constraints, it is crucial to prioritize encrypting most critical data for prediction accuracy and user privacy. In practice, certain segments contain richer information in prediction of the user’s future viewport and in ensuring a smooth playback. Given these considerations, we propose the Prediction Contribution Value (PCV), a metric that quantifies each segment’s importance by combining data variation frequency and spread variance. Specifically, the
of segment
is defined as
where
and
are weights with
,
represents the frequency of significant changes in segment
,
is the variance of the segment, and
is the mean of segment
.
To further address the need for selective encryption while adhering to stringent latency requirements, we introduce viewport obfuscation, a sophisticated mechanism that intelligently segments the user’s viewport data stream and dynamically applies stronger encryption only to segments identified as having a higher Prediction Contribution Value (PCV). This targeted approach achieves an optimal balance between robust privacy protection and computational efficiency, ensuring that sensitive data—such as detailed clinician head movements, gaze patterns, or patient biometric features—is adequately secured without exceeding the critical latency constraints necessary for real-time telemedicine interactions (e.g., sub-50 ms Motion-to-Photon latency for remote surgery). Instead of employing a brute-force method that encrypts the entire viewport stream, which would introduce significant computational overhead and jeopardize real-time performance, we propose a dimension-based segmentation strategy. This strategy involves independently analyzing each data dimension (e.g., X-axis position, yaw angle, pitch angle) to precisely identify and selectively encrypt “hot segments”—specific temporal or spatial regions within the viewport trajectory that are simultaneously critical for accurate viewport prediction and highly sensitive from a privacy perspective. By focusing encryption efforts on these high-PCV “hot segments,” our approach minimizes overall encryption latency and computational load, while maximizing the protection of the most vulnerable and informative data elements within the user’s viewport stream. In practice, we develop the following key techniques to form our comprehensive viewport obfuscation mechanism.
Dimension-Based Segmentation: Each data dimension (e.g., X-axis position, yaw angle, pitch angle) is segmented and analyzed independently, allowing fine-grained identification of regions critical for accurate predictions and privacy, as different dimensions vary in frequency and importance. This per-dimension analysis ensures that encryption decisions are tailored to the unique behavioral patterns in each axis of movement, avoiding over-encryption of stable or redundant signals.
High-Priority Segment Detection: High-frequency and high-variance segments are prioritized, as frequency tracks dynamic changes—such as rapid head turns or shifts in gaze direction—and variance captures wide behavioral ranges that may reveal sensitive contextual information (e.g., reaction to patient condition). These high-PCV segments are flagged as “hot segments” and marked for enhanced protection, since they simultaneously contribute most to viewport prediction accuracy and pose the greatest privacy risk if exposed.
Selective Encryption for Privacy and Prediction Efficiency: We employ a tiered encryption strategy based on the Prediction Contribution Value (). Let and define two thresholds: (medium) and (high). Segments are classified and encrypted as follows:
- -
High Encryption: → AES-256, 256-bit key, 14 rounds. Protects highly sensitive, predictive segments (e.g., rapid gaze shifts).
- -
Medium Encryption: → AES-192, 192-bit key, 12 rounds. Balances security and efficiency for moderately dynamic data.
- -
Low Encryption: → AES-128, 128-bit key, 10 rounds. Minimizes overhead for stable, low-risk segments.
This adaptive scheme focuses encryption on high-PCV “hot segments,” ensuring strong privacy for critical data while keeping end-to-end latency below 10 ms—meeting real-time demands of telemedicine applications like remote surgery.
3.3. Frame-Wise Encryption
Given the high bandwidth demands and privacy concerns in volumetric video streaming, selectively encrypting parts of each frame balances privacy protection with computational efficiency. Complementing viewport obfuscation, SecureTeleMed integrates frame-wise encryption, which secures only the sensitive parts of each rendered 2D frame during the transmission process. This approach enhances privacy by encrypting areas deemed sensitive based on the user’s viewport projection and relevant content changes, ensuring efficient protection without unnecessary processing.
User Perception Sensitivity Assessment: Building on prior work [
17] on user perception preferences in volumetric video, we extend the ROI formula for 2D rendering to evaluate sensitivity distribution in 3D content perception. To define the
User Perception Sensitivity for each tile in the rendered 2D frame, we extend the original formula in [
17], lifting it for 3D scenes to work with the projected tiles. The sensitivity score
for each 2D tile
t, is calculated as
where
represents the
projected density of points from the 3D cube onto the 2D tile
t, similar to the point density in the 3D cube;
indicates the
distance from the user’s headset to the center of the 3D cube corresponding to the tile
t; and
denotes the
frequency of the tile falling within the user’s gaze frustum. To enable the assessment of the user’s sensitivity distribution of each rendered frame, we use the following formula to estimate the frequency of interest for each tile
t:
where
is the
count of times the user’s gaze aligns with tile
t during sample
i, and
is the
total number of samples of user behavior.
Viewport Projection with Region of Interest (ROI) Mapping:
We use gaze and head orientation to map 3D areas of interest onto the 2D frame. Based on Formula (
4), each user interest score defines a 2D ROI, prioritizing tiles with higher scores for encryption to protect sensitive regions. After assigning interest scores to each tile in the 2D frame based on its 3D relevance, our SecureTeleMed applies encryption levels tailored to each area’s sensitivity. As shown in
Figure 2, each frame is divided into 8 × 8 tiles, with interest scores assigned for sensitivity assessment. High-sensitivity tiles are then selected for encryption, realizing
selective tile encryption.
To achieve a balance between encryption speed and security, a hybrid encryption scheme employing AES [
18] and RSA [
19] algorithms is utilized, as demonstrated in
Figure 3. AES, an efficient symmetric encryption algorithm, encrypts data using a shared key K, with the sender encrypting the message and the receiver decrypting it using the same key. Due to its high-speed encryption capabilities, AES is particularly advantageous for real-time video transmission. Our implementation utilizes AES in Galois/Counter Mode (GCM), which provides both confidentiality and integrity protection through authenticated encryption. GCM mode is particularly suitable for real-time volumetric video streaming where data authenticity is as critical as privacy, as it detects any tampering with the ciphertext. The GCM mode’s ability to perform encryption and authentication in parallel further enhances its efficiency for our low-latency telemedicine applications, ensuring that the cryptographic overhead remains minimal while providing robust security guarantees. To address the security concerns associated with direct key transmission, RSA, an asymmetric encryption algorithm, is incorporated. The sender encrypts the AES key K using the receiver’s public RSA key, generating a ciphertext. The receiver then decrypts this ciphertext using their private RSA key, retrieving the AES key K. Subsequently, the receiver decrypts the AES-encrypted message using K, thus obtaining the original message. This hybrid approach effectively integrates the high-speed data encryption of AES with the secure key exchange of RSA, resulting in efficient and secure communication.
Dynamic Encryption Allocation: We propose the Dynamic Encryption Allocation algorithm to optimize privacy and efficiency by assigning encryption levels based on each tile’s user perception sensitivity score . In particular, this approach ensures tiles with higher sensitivities receive stronger encryption, while less critical areas use lighter encryption, balancing privacy and real-time performance. The algorithm follows these specific steps:
Sensitivity-Based Encryption Assignment: Each tile in the frame is assigned a user perception sensitivity score , demonstrating its importance based on user perception. The algorithm then categorizes these tiles into three encryption levels by comparing each interest score with predefined thresholds and .
High Encryption is applied to tiles with , using a key length of 256 bits and 14 encryption rounds, ensuring robust security for the most sensitive data.
Medium Encryption is applied to tiles where , using a key length of 192 bits and 12 rounds, providing a balance of security and efficiency.
Low Encryption is assigned to tiles with , using a key length of 128 bits and 10 rounds, minimizing computations for less critical data.
Resource Constraints Adjustment: The algorithm evaluates the encryption plan’s cost against system constraints. If exceeded, thresholds and are adjusted to reduce encryption load, ensuring resource limits are met while prioritizing sensitive areas.
Efficiency and Adaptability: The selective encryption algorithm ensures real-time performance by focusing stronger encryption on high-interest areas, reducing computational load while maintaining privacy and efficiency.