Trusted Yet Flexible: High-Level Runtimes for Secure ML Inference in TEEs

Steiakakis, Nikolaos-Achilleas; Vasiliadis, Giorgos

doi:10.3390/jcp6010023

Open AccessArticle

Trusted Yet Flexible: High-Level Runtimes for Secure ML Inference in TEEs

by

Nikolaos-Achilleas Steiakakis

¹ and

Giorgos Vasiliadis

^2,3,*

¹

Department of Computer Science, University of Crete, Voutes Campus, 70013 Heraklion, Greece

²

Institute of Computer Science, FORTH (Foundation for Research & Technology–Hellas), 70013 Heraklion, Greece

³

Department of Management Science and Technology, Hellenic Mediterranean University, 72100 Agios Nikolaos, Greece

^*

Author to whom correspondence should be addressed.

J. Cybersecur. Priv. 2026, 6(1), 23; https://doi.org/10.3390/jcp6010023

Submission received: 29 October 2025 / Revised: 30 December 2025 / Accepted: 31 December 2025 / Published: 27 January 2026

(This article belongs to the Section Security Engineering & Applications)

Download

Browse Figure

Versions Notes

Abstract

Machine learning inference is increasingly deployed on shared and cloud infrastructures, where both user inputs and model parameters are highly sensitive. Confidential computing promises to protect these assets using Trusted Execution Environments (TEEs), yet existing TEE-based inference systems remain fundamentally constrained: they rely almost exclusively on low-level, memory-unsafe languages to enforce confinement, sacrificing developer productivity, portability, and access to modern ML ecosystems. At the same time, mainstream high-level runtimes, such as Python, are widely considered incompatible with enclave execution due to their large memory footprints and unsafe model-loading mechanisms that permit arbitrary code execution. To bridge this gap, we present the first Python-based ML inference system that executes entirely inside Intel SGX enclaves while safely supporting untrusted third-party models. Our design enforces standardized, declarative model representations (ONNX), eliminating deserialization-time code execution and confining model behavior through interpreter-mediated execution. The entire inference pipeline (including model loading, execution, and I/O) remains enclave-resident, with cryptographic protection and integrity verification throughout. Our experimental results show that Python incurs modest overheads for small models (≈

17 %

) and outperforms a low-level baseline on larger workloads (97% vs. 265% overhead), demonstrating that enclave-resident high-level runtimes can achieve competitive performances. Overall, our findings indicate that Python-based TEE inference is practical and secure, enabling the deployment of untrusted models with strong confidentiality and integrity guarantees while maintaining developer productivity and ecosystem advantages.

Keywords:

trusted execution environments (TEEs); Intel SGX; confidential computing; machine learning inference; ONNX runtime; Python runtime; Occlum; model confinement; memory safety; secure enclaves

1. Introduction

The rapid adoption of machine learning (ML) across domains such as healthcare, finance, and logistics has introduced new security and privacy challenges [1,2]. While ML models increasingly enable high-impact decision-making, their deployment in real-world settings often requires handling sensitive user data and proprietary model code [3,4]. These dual concerns (i.e., protecting private inputs and safeguarding intellectual property) become especially pressing in cloud-hosted or shared infrastructure environments, where users and model providers alike face a range of confidentiality risks [5,6].

Modern ML inference services typically adopt a service-oriented architecture, allowing users to submit data for inference and receive predictions via APIs. Meanwhile, model developers upload their pre-trained models to centralized platforms, sometimes monetizing their work through usage-based billing; creating ML marketplaces or “app stores” [7,8]. In such deployments, models and user data often meet inside the same runtime, giving rise to complex trust assumptions. A single platform may host multiple models from third-party providers and accept user inputs from clients with varying levels of sensitivity. This setting introduces several key challenges. First, users must trust remote inference services with sensitive personal or corporate data. Second, model providers must upload their proprietary models to infrastructure they do not control. Third, users execute models provided by third parties, without any assurance about the safety of the model’s logic.

Trusted Execution Environments (TEEs) provide a hardware-secured execution domain that enables isolated computation, remote attestation, and secure storage. These mechanisms allow sensitive code and data to be processed in a protected environment that remains isolated from the rest of the software stack, including privileged system components. Platforms such as Intel Software Guard Extensions (SGX) [9] realize these guarantees through enclaves that enforce strong isolation of both computation and state [10,11]. Typically, most TEE-based ML inference systems are implemented using low-level languages such as C or C++ [12,13,14,15]. This choice is driven primarily by the stringent memory constraints of enclaves, where the limited size of the Enclave Page Cache (EPC) makes the substantial memory footprint of high-level runtimes difficult to accommodate. Low-level implementations provide fine-grained control over memory usage and reduce runtime overhead, but they also introduce significant drawbacks, including increased development complexity, manual memory management, and limited access to modern ML ecosystems. In contrast, high-level languages such as Python enable rapid development, rich tooling, and seamless integration with mature ML libraries, but introduce new security challenges in adversarial settings. Common Python serialization mechanisms such as pickle permit arbitrary code execution during deserialization, rendering them fundamentally unsuitable for loading untrusted models inside enclaves and violating the principle that external inputs must remain non-executable. Consequently, enclaves running third-party models in such environments may be exposed to arbitrary computation capable of exfiltrating data, exploiting runtime vulnerabilities, or undermining confinement guarantees [16,17].

In this work, we address these challenges by combining the security guarantees of TEEs with the flexibility of high-level language runtimes and standardized model representations to enable secure and practical ML inference. We adopt the ONNX format, a declarative and verifiable model representation, to enforce a restricted execution model that prevents untrusted models from injecting arbitrary code, thereby ensuring safe execution within the Python runtime. Our implementation runs entirely inside Intel SGX enclaves and leverages a lightweight LibOS to provide only the essential system support for Python-based workloads. All models are stored in ONNX format and undergo integrity verification before loading, and the complete inference pipeline executes exclusively within the enclave. This design protects both model parameters and user inputs from untrusted system components while reducing the attack surface exposed to potential adversaries. Our evaluation demonstrates that, with proper confinement and runtime optimization, Python-based inference achieves a performance competitive with low-level implementations, confirming that secure enclave-resident ML is feasible without sacrificing developer productivity or ecosystem benefits.

The contributions of our work are as follows:

We design and implement the first Python-based ONNX inference runtime that executes entirely inside Intel SGX enclaves, demonstrating that high-level runtimes can provide strong confidentiality and integrity guarantees without sacrificing ecosystem usability or developer productivity.
We enforce ONNX as a declarative and verifiable model format to safely support untrusted third-party models, while preventing arbitrary code execution and code injection inside the enclave.
We conduct a comprehensive experimental evaluation, including a direct comparison with a state-of-the-art low-level Rust SGX backend. Our results show that Python achieves a competitive enclave performance while offering a more flexible and secure deployment path, challenging the prevailing assumption that TEE-based inference must be implemented in low-level languages to be practical.

2. Background

This section reviews Intel SGX and Occlum, a lightweight library OS that facilitates running applications inside enclaves while balancing ease of use and trusted computing base size.

2.1. Intel SGX

Intel^® Software Guard Extensions (SGX) constitute a CPU instruction set extension developed by Intel Corporation that provides a hardware-backed Trusted Execution Environment (TEE) integrated into modern Intel processors [9]. It enables the creation of isolated memory regions, known as enclaves, which protect sensitive code and data from unauthorized access, even when the operating system, hypervisor, or other privileged software is compromised. Isolation is enforced through hardware protections that encrypt enclave memory (the Enclave Page Cache) and restrict access so that only code executing within the enclave can read or modify its contents. Beyond isolation, Intel SGX provides additional primitives central to establishing trust: local and remote attestation allows a relying party to verify that an enclave is genuine, that it is running on legitimate Intel SGX hardware, and that its code has not been tampered with. These mechanisms are fundamental to confidential computing, as they enable secure provisioning of secrets, models, or cryptographic keys only to attested enclave instances. Finally, Intel SGX also supports sealing, allowing enclaves to store data securely across sessions using keys derived from hardware.

Intel SGX is particularly valuable in scenarios involving sensitive computation, such as protecting cryptographic keys, proprietary algorithms, or confidential user data. In machine learning settings like those described in Section 3.1, Intel SGX helps ensure that both model parameters and input data remain protected during inference or training, mitigating risks of leakage or unauthorized modification. However, Intel SGX enclaves come with inherent limitations. The protected memory region is small (typically around 128 MB), and paging beyond this limit incurs substantial performance penalties. Moreover, enclave development requires careful management of enclave transitions, memory passing, and trusted/untrusted interfaces, adding nontrivial complexity.

2.2. Occlum

Occlum [18] is a lightweight library operating system (libOS) designed to run unmodified or lightly modified Linux applications inside Intel SGX enclaves. It provides a POSIX-compatible runtime environment, which simplifies porting existing software to Intel SGX by abstracting many low-level enclave details. By enabling applications to run with minimal modifications, Occlum significantly reduces development effort compared to writing enclave code from scratch or directly using Intel SGX SDK primitives. However, this convenience comes with trade-offs. The inclusion of the Occlum runtime and its dependencies increases the trusted computing base (TCB) size inside the enclave. Despite this increase, the trade-off is often justified by the considerable ease of development, improved compatibility with Linux applications, and better performance optimizations compared to custom enclave implementations.

3. System Model and Constraints

This section describes the operational context of our machine learning inference setting and the constraints that influence the system design. We discuss the roles and trust relationships of data owners, model providers, and the execution platform, highlighting the main security challenges. We also examine the memory footprint of typical inference runtime components to identify the key requirements guiding our design.

3.1. Machine Learning Inference as a Service: Usage Models and Trust Challenges

Machine Learning (ML) inference services have become integral to modern data-driven applications, enabling clients to submit inputs (such as medical images, sensor readings, or financial records) to hosted models that return predictions or classifications. This architecture offers scalability, centralized model management, and seamless integration, making it attractive for diverse domains. In many deployments, the roles of data owners and model providers are decoupled: data owners seek to process their inputs using third-party models, while model providers distribute proprietary models through marketplaces such as Hugging Face Hub [19], AWS Marketplace for ML [20], or Google Cloud AI Hub [21], where access and compensation are typically usage-based. While this flexibility fosters innovation, it also magnifies trust challenges. Data owners must send plaintext inputs to infrastructure they do not control; model providers risk theft or tampering of their intellectual property; and platforms supporting third-party models risk executing artifacts that could lead to data exfiltration. These risks are amplified by inference engines and ML libraries that allow model providers to upload models expressed in rich, expressive languages or formats capable of arbitrary computation, increasing the potential for malicious logic to run inside the enclave or attempt to circumvent confinement. The situation is further complicated in multi-tenant environments, where strong isolation is required between competing data owners and between providers and consumers of models.

Within this landscape, an important design axis concerns the mechanism for loading and reconstructing model artifacts provided by third parties. In mainstream ecosystems, model parameters are frequently distributed in formats that rely on pickle-based deserialization due to convenience and tight tooling integration. However, pickle employs permissive deserialization semantics that can execute attacker-controlled payloads during model loading. In adversarial settings, and especially under the SGX threat model, where enclave control flow must be fully determined by the attested binary and external inputs must remain non-executable, this creates an unacceptable channel for code injection and violates core trust assumptions. A contrasting approach is to require models in a declarative, framework-agnostic format such as ONNX, executed by a minimal and auditable runtime. Treating models as data (rather than executable object graphs) enables structural validation and deterministic initialization prior to enclave admission, reduces the trusted computing base, and narrows the attack surface by eliminating deserialization-time execution. In addition, optimized ONNX runtimes support graph-level and operator-level transformations that can mitigate enclave overheads (e.g., EPC pressure and enclave transitions), improving throughput without expanding the enclave with large, dynamically extensible stacks.

3.2. Threat Model

Our system is framed as a privacy-preserving Machine Learning as a Service (MLaaS) platform designed for the execution of inference workloads on untrusted infrastructure. The operational context involves three primary entities with potentially competing security interests: Data Owners, who require confidentiality for sensitive inputs (e.g., financial records or medical images), Model Providers, who must safeguard their proprietary models as intellectual property, and the Untrusted Host Infrastructure, which encompasses all privileged software, including the operating system and hypervisor, that could attempt unauthorized access or tampering. Furthermore, a critical threat arises from executing third-party models that cannot be fully trusted, as such models may attempt to exfiltrate private data, exploit runtime vulnerabilities, or break isolation boundaries.

The security architecture addresses these threats by employing Intel SGX to protect sensitive code and data from all other software on the platform. The trusted computing base (TCB) resides entirely within the SGX enclave and comprises a software stack that includes the Python runtime environment (derived from the Anaconda distribution), essential numerical libraries such as NumPy, the ONNX Runtime engine, and the Occlum. To ensure model and data protection, the system enforces strict confidentiality and integrity guarantees: all client–server communication, including model uploads and inference requests, is protected using Transport Layer Security (TLS), while the the decryption of incoming data and the encryption of inference outputs occur exclusively within the enclave. Consequently, plaintext data never leaves the trusted memory boundary and remains inaccessible to privileged system components, and the complete inference pipeline executes entirely within the enclave to protect both user inputs and model parameters. In addition, model intellectual property is protected both at rest and during execution, as described in Section 4. Finally, the threat posed by untrusted third-party models is mitigated through confinement mechanisms that rely on a declarative and verifiable representation that treats the model as data, eliminating the risk of arbitrary code execution inherent to insecure serialization methods such as pickle.

3.3. Memory Usage Analysis of Inference Runtime Components

We evaluated the static trusted computing base (TCB) footprint of the Python-based inference stack components within the Occlum LibOS build environment. Measurements represent the total binary image size (including the LibOS, Alpine Linux userland, and the Python interpreter) required to seal the enclave. The sealed binary executable for a minimal Python runtime occupies approximately 120–150 MB; embedding NumPy, essential for numerical operations and .npy tensor I/O, increases this static footprint to 170–200 MB. The inclusion of ONNX Runtime [22], a lightweight inference engine optimized for executing models in the open ONNX format, further raises the total enclave image size to roughly 250–300 MB. In contrast, importing PyTorch pushes the binary baseline beyond 800 MB, making it unsuitable for SGX environments where TCB reduction is critical. Flask and Flask-RESTful, used for REST API exposure, add only a modest overhead to the image. Combining Flask for service orchestration and ONNX Runtime for execution offers the best trade-off between functionality and storage efficiency, providing a secure, lightweight foundation for enclave-based inference services.

The collective memory usage of these components is presented in Figure 1. As we can see, the combination of Flask for the REST API interface and ONNX Runtime for model execution emerges as the most memory-efficient and practical solution for secure ML inference within Intel SGX. This setup leverages ONNX Runtime’s optimized inference capabilities alongside Flask’s simplicity in service orchestration. Consequently, these components form a strong foundation for building secure, memory-efficient, and scalable inference services within the enclave.

4. Design and Implementation

Our system addresses four key challenges in confidential ML inference: (i) protecting user data, (ii) safeguarding proprietary models, (iii) securely executing untrusted models, (iv) providing a usable programming interface. To protect sensitive input, model execution and output encryption all occur within the enclave, ensuring confidentiality and integrity on untrusted infrastructure. Proprietary models are stored encrypted and integrity-protected, and they are only decrypted inside the secure enclave at runtime, preventing tampering and unauthorized access by the host. Untrusted third-party models are restricted to the ONNX format and executed through an interpreter, which limits the execution surface and reduces the risk of arbitrary or malicious behavior. Finally, the use of Python offers a high-level, flexible interface that leverages the rich ML ecosystem while simplifying development and maintaining security guarantees.

4.1. System Overview

We design a privacy-preserving machine learning inference system based entirely on a Python software stack, tailored for execution in untrusted environments. The execution environment is isolated within an Intel SGX enclave and runs the full inference pipeline inside the enclave, providing hardware-enforced confidentiality and integrity for both user data and machine learning model artifacts. The system adopts a client–server model, where a lightweight, enclave-resident service handles inference requests. The use of Python offers several advantages. It allows integration with widely adopted ML tools and workflows, simplifies development, and enables rapid prototyping.

Model confidentiality is enforced by storing and executing models in an encrypted form. Models are loaded directly into the enclave and verified for integrity using cryptographic checksums computed over the model files, ensuring that any tampering or corruption is detected before execution. At no point does the plaintext model leave the enclave, ensuring that proprietary intellectual property remains inaccessible to the underlying platform. To mitigate risks from potentially untrusted model sources, the system restricts execution to interpreted execution (in ONNX format) by the enclave-side runtime. This design avoids executing arbitrary code and confines model behavior within a well-defined execution interface, reducing the attack surface. By leveraging ONNX as an intermediate format, the system remains compatible with models trained in frameworks such as PyTorch and TensorFlow.

4.2. Python Runtime

Our Python runtime is deployed within Occlum, a memory-safe library operating system for Intel SGX enclaves (see Section 2.2). Occlum executes Python as a LibOS process inside the enclave without requiring any enclave-specific code modifications, intercepting system calls made by the Python interpreter. The runtime is built on a minimal Alpine Linux base using the musl C library and busybox utilities, reducing memory footprint and the size of the trusted computing base while maintaining a full POSIX environment. On top of this base, we integrate the Anaconda distribution to provide the full machine learning ecosystem, including a pre-built Python interpreter and widely used libraries such as NumPy, SciPy, Pandas, and ONNX Runtime. All required packages are installed into Occlum’s secure root filesystem, ensuring the Python environment is self-contained and enclave-resident.

Models are stored and accessed using Occlum’s Secure Encrypted File System (SEFS), which provides AES-GCM encryption, integrity verification, and replay protection. Each model is assigned a cryptographic checksum (SHA-256) that is verified before loading and execution to prevent tampering or corruption. Models can reside persistently on disk or be loaded into enclave memory on demand, with optional caching to reduce startup latency for frequently used workloads. During inference, models are executed within an ONNX Runtime session inside the enclave, and can be evicted from memory when no longer needed to reduce memory pressure. All cryptographic keys are generated and managed within the enclave, ensuring that model confidentiality and integrity are maintained even if the underlying storage is untrusted. Models may be pre-loaded at build time or dynamically uploaded at runtime through the REST API (see Section 4.3).

4.3. Secure Client–Server Communication

The enclave hosts a lightweight Flask-based web server that exposes a minimal set of REST API endpoints for model management and inference execution. These include model upload, where providers submit ONNX models for encrypted storage and enclave execution; inference execution, where users submit input data for processing with a specified model; and model metadata queries, which allow clients to retrieve information about available models. All request handling, input validation, and any required preprocessing are performed entirely within the enclave.

To protect models, user inputs, and inference results during transmission, all client–server communication is encrypted and authenticated using Transport Layer Security (TLS). This protection applies uniformly to both model uploads and inference requests. As a result, decryption of incoming data and encryption of inference outputs occur exclusively inside the enclave, ensuring that plaintext data never leaves trusted memory and remains inaccessible to privileged system components such as the operating system or hypervisor.

4.4. ONNX Model Execution

Inference is performed entirely within Intel SGX enclaves using the Python API of the official ONNX Runtime [23]. Models may be preloaded into enclave memory for low-latency execution or stored on disk and loaded on demand. Inputs are provided in the .npy format, the native NumPy [24] binary representation, which preserves array shape, type, and layout. This format is widely used in Python ML workflows and supports arbitrary tensor shapes. Before execution, the server validates the request by confirming the model’s availability, checking tensor rank, shape, and data type, and ensuring all operations occur inside the enclave. Validated data is then passed to ONNX Runtime, and inference results are computed. Outputs are returned to the client over a TLS-encrypted channel, as described in Section 4.3.

5. Evaluation

We evaluate our system along two independent axes: execution target (native CPU vs. Intel SGX) and storage mode (disk vs. memory). Our goals are to (i) quantify the end-to-end inference cost introduced by SGX, and (ii) understand how model placement affects latency. To contextualize these results, we use a Rust implementation of ONNX Runtime as a compiled baseline, against which we compare the performance of our Python-based prototype.

5.1. Experimental Setup

All experiments were performed on a Dell OptiPlex 7050 (Dell Inc.) running Ubuntu 20.04.1 (Linux 5.15.0-76-generic) with an Intel Core i7-7700 @ 4.2 GHz (4C/8T, 8 MB L3), 8 GB RAM, and Intel^® Software Guard Extensions (SGX v1.0).

5.2. Workloads and Scenarios

We evaluate our system using a diverse suite of 13 ONNX models (Table 1), spanning computer vision and natural language processing with sizes from 31 MB to 830 MB, including ResNet-18/50/101/152, MobileNetV2, EfficientNet (Lite4, V2-L), DenseNet, and Inception-V3. This set stresses both small, latency-sensitive networks and large, memory-intensive models.

5.3. Metrics and Methodology

Our primary performance metric is the end-to-end latency (in seconds) per inference request. Unless otherwise noted, latency is measured on the client side, from the moment an HTTP request is issued until the corresponding HTTP response is received. This measurement includes request parsing, input deserialization (.npy), optional model loading and initialization (for disk-based scenarios), full ONNX Runtime execution, result serialization, and authenticated encryption performed inside the enclave. Network transfer overhead is explicitly excluded: the client executes on the same host and communicates over a loopback interface to eliminate network-induced variability. To reduce experimental noise and ensure a near-deterministic execution environment, the server process is pinned to a dedicated CPU core with hyper-threading disabled. ONNX Runtime is configured to use a single inference thread (intra_op=1, inter_op=1) to avoid scheduling artifacts and non-deterministic parallel execution paths. This configuration is applied consistently across both Python and Rust implementations.

We evaluate multiple execution scenarios. For the Disk-based performance, we flush system caches between timed runs using sync; echo 3 > /proc/sys/vm/drop_caches, exposing worst-case disk I/O behavior and EPC paging effects. For the Memory-based performance, models are preloaded into the host memory. Each experiment consists of 25 runs per model and configuration. The first five runs are treated as the warm-up and excluded from analysis, while the remaining 20 runs are used to compute the results. The reported values correspond to the median latency across these runs. We additionally compute the interquartile range (IQR) and verify that all reported medians are stable, with IQRs below 10% of the median. Batch size is fixed to 1 for all experiments, and identical input tensors are reused across runs to ensure comparability.

Overall, the controlled setup yields highly stable measurements, particularly for memory-resident models where preallocated tensors and deterministic execution paths reduce runtime fluctuations. Variance remains negligible (typically below 1%) for EPC-resident models, while disk-based scenarios exhibit slightly higher variability due to Secure Encrypted File System (SEFS) overhead and per-block integrity verification. While we focus on median latency in this study, future iterations will report additional dispersion metrics, including standard deviation and 95% confidence intervals, to further strengthen statistical rigor.

5.4. Memory-Based Performance

Table 2 reports the end-to-end inference latency for all evaluated models when preloaded into memory, comparing execution on a native CPU to execution inside an Intel SGX enclave. With native execution, the inference latency remains consistently low for small and medium-sized networks. Models such as SqueezeNet (0.076 s) and ResNet-50 (0.283 s) exhibit minimal computation times, while even large architectures such as EfficientNet-V2-L remain under 1 s. This baseline indicates that, in the absence of trusted execution overheads, inference is dominated by predictable tensor computation and memory access patterns. When executed inside SGX, inference time increases for all models, and the degree of slowdown correlates strongly with model size. For small networks, the overhead is negligible—for example, SqueezeNet rises only from 0.076 s to 0.089 s—showing that the entire working set fits comfortably inside the Enclave Page Cache (EPC). For larger models, however, the slowdown is more pronounced, such as EfficientNet-V2-L, which increases from 0.994 s to 1.958 s. This behavior is fully consistent with SGX architectural constraints: when the working set approaches EPC capacity, SGX incurs expensive paging events and per-cache-line integrity verification. Each EPC miss triggers hardware-driven encryption, integrity tree updates, and enclave transitions, significantly amplifying memory access latency within deep operator kernels. Thus, SGX overhead scales not with FLOPs, but with the volume and locality of memory accesses.

A notable observation from these results is that, despite the Python runtime occupying roughly 200 MB inside the enclave, its presence does not substantially impact SGX performance. This is because SGX overhead is determined by the dynamic working-set footprint rather than the static enclave size. During inference, the vast majority of Python’s memory remains cold and is not actively accessed. The hot path consists almost exclusively of ONNX operator kernels and their tensor buffers, which form a compact and frequently reused working set. Since SGX only incurs protection overhead on cache lines that are touched, the interpreter’s footprint does not meaningfully contribute to EPC pressure, cache-miss cascades, or enclave exits. In practice, the dominant factor in SGX overhead is large-model tensor access; not the size of the Python runtime.

In summary, memory-resident inference inside SGX introduces modest and predictable slowdowns for small and medium models, and bounded overhead for larger ones whose working sets stress EPC capacity. These results demonstrate that Python-based ONNX inference is practical for in-enclave deployment, provided that large models are chosen or optimized with enclave memory locality in mind.

5.5. Comparison with Low-Level Implementations

To complement the Python analysis, Table 3 reports the memory-resident inference latency for a low-level Rust implementation. The Rust runtime is based on the open-source enclave backend from BlindAI [25]. Although the absolute CPU results are not directly comparable to Python due to fundamental differences in software stacks, the relative SGX overhead is comparable, since both implementations execute the same ONNX models on identical hardware under the same enclave conditions.

The results show that the Rust implementation experiences noticeably higher SGX overhead than the Python runtime. Since both versions rely on the same ONNX operators inside the enclave, the difference is not due to the computation itself, but to how each runtime interacts with memory and the enclave execution model. The Rust implementation integrates more tightly with ONNX Runtime at a low level, which can expose additional SGX-specific costs such as metadata checks and EPC movement on frequently accessed buffers. In contrast, the Python-based stack tends to reuse larger preallocated tensors and follows a more stable execution pattern during inference, which can reduce the number of SGX-protected memory operations in the hot path. These results suggest that, inside SGX, runtime behavior and memory locality can have a greater impact on overhead than native execution speed, and that a low-level implementation does not automatically translate to lower enclave overhead.

5.6. Disk-Based Performance

Table 4 presents the end-to-end inference latency when models are loaded from disk prior to execution on both CPU and SGX. This scenario reflects a cold-start condition in which model weights must be fetched from storage before the ONNX runtime begins inference.

On native execution, even cold loads remain relatively fast, as models are simply read into memory and executed with minimal overhead. In contrast, SGX exhibits significantly higher latency in this setting. For example, ResNet-50 increases from 0.321 s to 8.142 s, while EfficientNet-V2-L rises from 0.905 s to 25.412 s. The dominant source of this slowdown is not the ONNX computation itself, but the cost of loading the encrypted model into the enclave. In the SGX case, model parameters must be read from disk, decrypted, verified, and copied into EPC-protected memory before execution can begin. This preprocessing stage introduces substantial latency that is absent in the memory-resident setup described earlier. This highlights a key difference between the two evaluation scenarios: in the memory-based case, execution starts with the model already resident inside protected memory, whereas in the disk-based case, every cold start pays an expensive loading and decryption cost before inference. As a result, disk-based SGX inference represents a worst-case configuration, and should be avoided in latency-sensitive deployments through model preloading or enclave-side caching.

6. Related Work

Ensuring the privacy and security of machine learning models and data processing has been a focal point in recent research. Various approaches have been explored, each with unique advantages and trade-offs. Below, we discuss the main techniques used for secure machine learning execution.

Homomorphic encryption allows computations to be performed directly on encrypted data without needing to decrypt it first [26]. This ensures data privacy throughout processing, making it suitable for scenarios where data confidentiality is paramount. However, homomorphic encryption comes with significant computational overhead, which can limit its scalability and practicality for large-scale machine learning tasks. Secure Multi-Party Computation (SMPC) is another approach that enables multiple parties to jointly compute a function over their inputs while keeping those inputs private [27,28]. It is highly suitable for collaborative machine learning models where data cannot be shared openly between parties. The main challenge with SMPC lies in its communication and computational complexity, which may impact performance for large-scale models.

Trusted Execution Environments (TEEs), such as Intel SGX, provide a secure enclave within the processor that isolates code and data from the rest of the system [10,29,30]. These environments ensure that the code and data within the enclave remain secure, even if the operating system is compromised. Privado is a notable early system that focused on practical and secure DNN inference using enclaves [12]. Occlumency utilizes SGX to provide privacy guarantees for remote deep-learning tasks [13]. SOTER has been proposed to guard black-box inference for general neural networks at the edge [14]. Temper focuses on providing secure MLaaS capabilities through trusted and efficient model partitioning and enclave reuse strategies [15]. The present work differentiates itself by challenging the convention that TEE-based inference must rely exclusively on low-level languages. Instead, it explores combining TEE isolation with high-level languages, specifically Python and declarative model formats (ONNX), to achieve strong security properties through interpreter-based confinement, while also delivering competitive performance.

Finally, hybrid TEE approaches [31] combine the strengths of multiple security mechanisms, such as TEEs and traditional cryptographic methods. This can offer enhanced security and performance by leveraging the unique properties of each method. For instance, a system may use homomorphic encryption alongside TEEs to ensure both data confidentiality during transmission and secure execution within the enclave. These approaches can be used orthogonally with our system.

7. Conclusions

This work demonstrates that secure machine learning inference inside Intel SGX enclaves can be achieved using high-level languages without resorting to restrictive execution models. We designed and implemented a Python-based enclave runtime that executes ONNX models end-to-end within SGX, providing strong confidentiality and integrity guarantees for both user inputs and model artifacts. In addition, the use of a declarative and verifiable model-loading pipeline enables the secure deployment of untrusted third-party models within TEEs, avoiding unsafe serialization mechanisms such as pickle. Our evaluation shows that Python achieves a performance comparable to a low-level Rust baseline when models are memory-resident. Performance degrades when models must be loaded from disk due to the overhead of Intel SGX decryption and memory movement, highlighting the importance of enclave-side caching and preloading mechanisms. Collectively, these results challenge the assumption that high-level runtimes are unsuitable for secure ML inference: Python-based designs can provide secure model confinement, practical developer productivity, and competitive enclave performance. While SGX remains subject to well-known side-channel risks, our approach emphasizes strong enclave confinement and practical mitigations such as model preloading and locality-aware execution. Future work will explore hybrid execution strategies, model partitioning to reduce EPC pressure, and the automated static validation of ONNX graphs for stronger pre-admission guarantees.

Author Contributions

Conceptualization, N.-A.S. and G.V.; methodology, N.-A.S. and G.V.; software, N.-A.S.; validation, N.-A.S. and G.V.; investigation, N.-A.S. and G.V; writing—original draft preparation, N.-A.S. and G.V; writing—review and editing, N.-A.S. and G.V.; visualization, N.-A.S.; supervision, G.V.; funding acquisition, G.V. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by dAIEdge funded by the European Commission under Grant 101120726.

Data Availability Statement

The original contributions presented in this study are included in the article; further inquiries or requests can be directed to the corresponding authors.

Acknowledgments

We acknowledge the contribution of AI-assisted tools in supporting grammar and syntax checks. The authors have reviewed and edited all generated material and assume full responsibility for the final content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Strader, T.J.; Rozycki, J.J.; Root, T.H.; Huang, Y.H.J. Machine learning stock market prediction studies: Review and research directions. J. Int. Technol. Inf. Manag. 2020, 28, 63–83. [Google Scholar] [CrossRef]
Dhanalakshmi, R.; Benjamin, M.; Sivaraman, A.; Sood, K.; Sreedeep, S. Machine Learning-Based Smart Appliances for Everyday Life. In Smart Analytics, Artificial Intelligence and Sustainable Performance Management in a Global Digitalised Economy; Emerald Publishing Limited: Bingley, UK, 2023; Volume 110, pp. 289–301. [Google Scholar]
Liu, B.; Ding, M.; Shaham, S.; Rahayu, W.; Farokhi, F.; Lin, Z. When machine learning meets privacy: A survey and outlook. ACM Comput. Surv. (CSUR) 2021, 54, 1–36. [Google Scholar] [CrossRef]
Yeom, S.; Giacomelli, I.; Fredrikson, M.; Jha, S. Privacy risk in machine learning: Analyzing the connection to overfitting. In Proceedings of the 2018 IEEE 31st Computer Security Foundations Symposium (CSF), Oxford, UK, 9–12 July 2018; pp. 268–282. [Google Scholar]
McGraw, G.; Bonett, R.; Shepardson, V.; Figueroa, H. The top 10 risks of machine learning security. Computer 2020, 53, 57–61. [Google Scholar] [CrossRef]
Tan, S.; Taeihagh, A.; Baxter, K. The risks of machine learning systems. arXiv 2022, arXiv:2204.09852. [Google Scholar] [CrossRef]
Services, A.W. Algorithms and Packages in the AWS Marketplace. Available online: https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-marketplace.html (accessed on 2 August 2025).
Microsoft Ignite. Use a Custom Container to Deploy a Model to an Online Endpoint. 2025. Available online: https://learn.microsoft.com/en-us/azure/machine-learning/how-to-deploy-custom-container (accessed on 1 August 2025).
Costan, V.; Devadas, S. Intel SGX Explained. IACR Cryptology ePrint Archive 2016, Paper 086. Available online: https://eprint.iacr.org/2016/086 (accessed on 30 December 2025).
Schneider, M.; Masti, R.J.; Shinde, S.; Capkun, S.; Perez, R. Sok: Hardware-supported trusted execution environments. arXiv 2022, arXiv:2205.12742. [Google Scholar]
Hunt, T.; Zhu, Z.; Xu, Y.; Peter, S.; Witchel, E. Ryoan: A distributed sandbox for untrusted computation on secret data. ACM Trans. Comput. Syst. (TOCS) 2018, 35, 1–32. [Google Scholar] [CrossRef]
Grover, K.; Tople, S.; Shinde, S.; Bhagwan, R.; Ramjee, R. Privado: Practical and secure DNN inference with enclaves. arXiv 2018, arXiv:1810.00602. [Google Scholar]
Lee, T.; Lin, Z.; Pushp, S.; Li, C.; Liu, Y.; Lee, Y.; Xu, F.; Xu, C.; Zhang, L.; Song, J. Occlumency: Privacy-preserving remote deep-learning inference using SGX. In Proceedings of the 25th Annual International Conference on Mobile Computing and Networking, Los Cabos, Mexico, 21–25 October 2019; pp. 1–17. [Google Scholar]
Shen, T.; Qi, J.; Jiang, J.; Wang, X.; Wen, S.; Chen, X.; Zhao, S.; Wang, S.; Chen, L.; Luo, X.; et al. SOTER: Guarding black-box inference for general neural networks at the edge. In Proceedings of the 2022 USENIX Annual Technical Conference (USENIX ATC 22), Carlsbad, CA, USA, 11–13 July 2022; pp. 723–738. [Google Scholar]
Li, F.; Li, X.; Gao, M. Secure MLaaS with Temper: Trusted and Efficient Model Partitioning and Enclave Reuse. In Proceedings of the 39th Annual Computer Security Applications Conference, Austin, TX, USA, 4–8 December 2023; pp. 621–635. [Google Scholar]
Lee, J.; Jang, J.; Jang, Y.; Kwak, N.; Choi, Y.; Choi, C.; Kim, T.; Peinado, M.; Kang, B.B. Hacking in darkness: Return-oriented programming against secure enclaves. In Proceedings of the 26th USENIX Security Symposium (USENIX Security 17), Vancouver, BC, Canada, 16–18 August 2017; pp. 523–539. [Google Scholar]
Van Bulck, J.; Oswald, D.; Marin, E.; Aldoseri, A.; Garcia, F.D.; Piessens, F. A tale of two worlds: Assessing the vulnerability of enclave shielding runtimes. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, London, UK, 11–15 November 2019; pp. 1741–1758. [Google Scholar]
Shen, Y.; Tian, H.; Chen, Y.; Chen, K.; Wang, R.; Xu, Y.; Xia, Y.; Yan, S. Occlum: Secure and efficient multitasking inside a single enclave of intel sgx. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 16–20 March 2020; pp. 955–970. [Google Scholar]
Hugging Face. Hugging Face Hub. Available online: https://huggingface.co/models (accessed on 1 August 2025).
Amazon Web Services. AWS Marketplace for Machine Learning. Available online: https://aws.amazon.com/marketplace/solutions/machine-learning (accessed on 1 August 2025).
Google Cloud. Google Cloud AI Hub. Available online: https://cloud.google.com/ai-hub (accessed on 1 August 2025).
Zhao, Y.; He, R.; Kersting, N.; Liu, C.; Agrawal, S.; Chetia, C.; Gu, Y. ONNXExplainer: An ONNX Based Generic Framework to Explain Neural Networks Using Shapley Values. arXiv 2023, arXiv:2309.16916. [Google Scholar] [CrossRef]
Kim, S.Y.; Lee, J.; Kim, C.H.; Lee, W.J.; Kim, S.W. Extending the ONNX Runtime Framework for the Processing-in-Memory Execution. In Proceedings of the 2022 International Conference on Electronics, Information, and Communication (ICEIC), Jeju, Republic of Korea, 6–9 February 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–4. [Google Scholar]
Oliphant, T.E. Guide to Numpy; Trelgol Publishing: New York City, NY, USA, 2006; Volume 1. [Google Scholar]
Mithril Security. BlindAI. Available online: https://github.com/mithril-security/blindai (accessed on 1 October 2025).
Acar, A.; Aksu, H.; Uluagac, A.S.; Conti, M. A survey on homomorphic encryption schemes: Theory and implementation. ACM Comput. Surv. (Csur) 2018, 51, 1–35. [Google Scholar] [CrossRef]
Zhou, I.; Tofigh, F.; Piccardi, M.; Abolhasan, M.; Franklin, D.; Lipman, J. Secure Multi-Party Computation for Machine Learning: A Survey. IEEE Access 2024, 12, 53881–53899. [Google Scholar] [CrossRef]
Knott, B.; Venkataraman, S.; Hannun, A.; Sengupta, S.; Ibrahim, M.; van der Maaten, L. Crypten: Secure multi-party computation meets machine learning. Adv. Neural Inf. Process. Syst. 2021, 34, 4961–4973. [Google Scholar]
Zheng, W.; Wu, Y.; Wu, X.; Feng, C.; Sui, Y.; Luo, X.; Zhou, Y. A survey of Intel SGX and its applications. Front. Comput. Sci. 2021, 15, 153808. [Google Scholar] [CrossRef]
Liu, C.; Guo, H.; Xu, M.; Wang, S.; Yu, D.; Yu, J.; Cheng, X. Extending on-chain trust to off-chain–trustworthy blockchain data collection using trusted execution environment (tee). IEEE Trans. Comput. 2022, 71, 3268–3280. [Google Scholar] [CrossRef]
Natarajan, D.; Loveless, A.; Dai, W.; Dreslinski, R. CHEX-MIX: Combining Homomorphic Encryption with Trusted Execution Environments for Oblivious Inference in the Cloud. In Proceedings of the 2023 IEEE 8th European Symposium on Security and Privacy (EuroS&P), Delft, The Netherlands, 3–7 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 73–91. [Google Scholar]

Figure 1. Memory usage of different runtime components and libraries in the Python ML environment. The stacked bars show the baseline footprint of Python together with ONNX Runtime, PyTorch, NumPy, and Flask. This comparison highlights their relative overheads, providing insight into their suitability for secure, memory-constrained inference deployments.

Table 1. ONNX models used in evaluation and their sizes.

Model	Size (MB)
SqueezeNet-1.0	31.68
ResNet-18	66.50
EfficientNet-Lite4	179.77
Inception-V3	188.41
MobileNetV2	214.76
DenseNet	337.03
ResNet-50	376.93
ResNet-101	592.75
ResNet-152	829.46
EfficientNet-V2-L	875.63

Table 2. End-to-end inference latency (seconds) for memory-resident ONNX models using our Python-based system proposed in this paper. The table compares native CPU execution with SGX execution, and the Overhead column reflects the relative slowdown introduced by SGX.

Model	CPU (s)	SGX (s)	Overhead (%)
SqueezeNet-1.0	0.076	0.089	17.1%
ResNet-18	0.157	0.336	113.8%
EfficientNet-Lite4	0.180	0.393	118.3%
Inception-V3	0.225	0.587	160.9%
MobileNetV2	0.092	0.092	0.0%
DenseNet	0.230	0.370	60.9%
ResNet-50	0.283	0.551	94.6%
ResNet-101	0.506	0.854	68.9%
ResNet-152	0.655	1.098	67.7%
EfficientNet-V2-L	0.994	1.958	96.9%

Table 3. End-to-end inference latency (seconds) for memory-resident ONNX models using a low-level Rust implementation (based on BlindAI [25]). The Overhead column reflects the relative slowdown introduced by SGX.

Model	CPU (s)	SGX (s)	Overhead (%)
SqueezeNet-1.0	0.121	0.096	$- 20.7$ %
ResNet-18	0.295	0.647	119.3%
EfficientNet-Lite4	0.577	1.463	153.6%
Inception-V3	0.697	1.656	137.6%
MobileNetV2	0.229	0.400	74.7%
DenseNet	0.452	1.123	148.5%
ResNet-50	0.665	1.328	99.7%
ResNet-101	0.913	1.967	115.4%
ResNet-152	1.144	2.578	125.3%
EfficientNet-V2-L	1.316	4.802	264.9%

Table 4. End-to-end inference latency (seconds) for ONNX models loaded from disk before execution using our Python-based system proposed in this paper. Native CPU execution is compared with SGX execution, and the Overhead column reflects the relative slowdown introduced by SGX.

Model	CPU (s)	SGX (s)	Overhead (%)
SqueezeNet-1.0	0.073	0.192	163.0%
ResNet-18	0.150	3.239	2059.3%
EfficientNet-Lite4	0.177	2.234	1162.1%
Inception-V3	0.279	5.477	1863.4%
MobileNetV2	0.084	1.145	1263.1%
DenseNet	0.215	2.018	838.1%
ResNet-50	0.321	8.142	2436.4%
ResNet-101	0.447	13.554	2929.5%
ResNet-152	0.637	19.995	3040.0%
EfficientNet-V2-L	0.905	25.412	2709.8%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Steiakakis, N.-A.; Vasiliadis, G. Trusted Yet Flexible: High-Level Runtimes for Secure ML Inference in TEEs. J. Cybersecur. Priv. 2026, 6, 23. https://doi.org/10.3390/jcp6010023

AMA Style

Steiakakis N-A, Vasiliadis G. Trusted Yet Flexible: High-Level Runtimes for Secure ML Inference in TEEs. Journal of Cybersecurity and Privacy. 2026; 6(1):23. https://doi.org/10.3390/jcp6010023

Chicago/Turabian Style

Steiakakis, Nikolaos-Achilleas, and Giorgos Vasiliadis. 2026. "Trusted Yet Flexible: High-Level Runtimes for Secure ML Inference in TEEs" Journal of Cybersecurity and Privacy 6, no. 1: 23. https://doi.org/10.3390/jcp6010023

APA Style

Steiakakis, N.-A., & Vasiliadis, G. (2026). Trusted Yet Flexible: High-Level Runtimes for Secure ML Inference in TEEs. Journal of Cybersecurity and Privacy, 6(1), 23. https://doi.org/10.3390/jcp6010023

Article Menu

Trusted Yet Flexible: High-Level Runtimes for Secure ML Inference in TEEs

Abstract

1. Introduction

2. Background

2.1. Intel SGX

2.2. Occlum

3. System Model and Constraints

3.1. Machine Learning Inference as a Service: Usage Models and Trust Challenges

3.2. Threat Model

3.3. Memory Usage Analysis of Inference Runtime Components

4. Design and Implementation

4.1. System Overview

4.2. Python Runtime

4.3. Secure Client–Server Communication

4.4. ONNX Model Execution

5. Evaluation

5.1. Experimental Setup

5.2. Workloads and Scenarios

5.3. Metrics and Methodology

5.4. Memory-Based Performance

5.5. Comparison with Low-Level Implementations

5.6. Disk-Based Performance

6. Related Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI