DACCA: Distributed Adaptive Cloud Continuum Architecture

Deligiannakis, Nektarios; Papataxiarhis, Vassilis; Loukeris, Michalis; Hadjiefthymiades, Stathes; Touloupou, Marios; Ul Hassan, Syed Mafooq; Herodotou, Herodotos; Moustakas, Thanasis; Bampis, Emmanouil; Ioannidis, Konstantinos; Michailidis, Iakovos T.; Vrochidis, Stefanos; Kosmatopoulos, Elias; Romero Martínez, Francisco Javier; Marín Pérez, Rafael; Mousa, Amr; Castellini, Jacopo; Strasser, Pablo

doi:10.3390/fi18020074

Open AccessArticle

DACCA: Distributed Adaptive Cloud Continuum Architecture

by

Nektarios Deligiannakis

^1,*,

Vassilis Papataxiarhis

¹

,

Michalis Loukeris

¹,

Stathes Hadjiefthymiades

¹

,

Marios Touloupou

²

,

Syed Mafooq Ul Hassan

²

,

Herodotos Herodotou

²

,

Thanasis Moustakas

^3,*

,

Emmanouil Bampis

³,

Konstantinos Ioannidis

³

,

Iakovos T. Michailidis

³

,

Stefanos Vrochidis

³

,

Elias Kosmatopoulos

³

,

Francisco Javier Romero Martínez

⁴

,

Rafael Marín Pérez

⁴

,

Amr Mousa

⁵

,

Jacopo Castellini

⁶

and

Pablo Strasser

⁶

¹

Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, 15784 Athens, Greece

²

Department of Electrical Engineering and Computer Science and Engineering, Cyprus University of Technology, 3036 Limassol, Cyprus

³

Information Technologies Institute (ITI), Centre for Research and Technology Hellas (CERTH), 57001 Thessaloniki, Greece

⁴

Odin Solutions S.L. (ODINS), 30830 Murcia, Spain

⁵

Virtual Vehicle Research GmbH, 8010 Graz, Austria

⁶

Haute Ecole de Gestion de Genève, University of Applied Sciences and Arts Western Switzerland (HES-SO), 1202 Geneva, Switzerland

^*

Authors to whom correspondence should be addressed.

Future Internet 2026, 18(2), 74; https://doi.org/10.3390/fi18020074

Submission received: 10 December 2025 / Revised: 12 January 2026 / Accepted: 23 January 2026 / Published: 1 February 2026

(This article belongs to the Special Issue Scalable and Distributed Cloud Continuum Orchestration for Next-Generation IoT Applications: Latest Advances and Prospects—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Recently, the need for unified orchestration frameworks that can manage extremely heterogeneous, distributed, and resource-constrained environments has emerged due to the rapid development of cloud, edge, and IoT computing. Kubernetes and other traditional cloud-native orchestration systems are not built to facilitate autonomous, decentralized decision-making across the computing continuum or to seamlessly integrate non-container-native devices. This paper presents the Distributed Adaptive Cloud Continuum Architecture (DACCA), a Kubernetes-native architecture that extends orchestration beyond the data center to encompass edge and Internet of Things infrastructures. Decentralized self-awareness and swarm formation are supported for adaptive and resilient operation, a resource and application abstraction layer is established for uniform resource representation, and a Distributed and Adaptive Resource Optimization (DARO) framework based on multi-agent reinforcement learning is integrated for intelligent scheduling in the proposed architecture. Verifiable identity, access control, and tamper-proof data exchange across heterogeneous domains are further ensured by a zero-trust security framework based on distributed ledger technology. When combined, these elements enable increasingly autonomous workload orchestration, trading centralized control for adaptive, decentralized operation with enhanced interoperability, scalability, and trust. Thus, the proposed architecture enables self-managing and context-aware orchestration systems that support next-generation AI-driven distributed applications across the entire computing continuum.

Keywords:

distributed orchestration; computing continuum; cloud–edge–IoT continuum; kubernetes-native architecture; multi-agent reinforcement learning; adaptive resource optimization; zero-trust security; distributed ledger technology

Graphical Abstract

1. Introduction

Container orchestration engines facilitate cloud services by automating the deployment, scaling, and management of containerized applications. Kubernetes [1] is the predominant engine, offered as a managed service by major cloud providers, including Google Kubernetes Engine (GKE), Amazon Elastic Kubernetes Service (EKS), and Azure Kubernetes Service (AKS). Other engines, such as Docker Swarm [2], OpenShift [3], and Mesos [4], provide alternative features and deployment models. Across all these platforms, orchestration enables containerized applications to be scheduled on available node resources and to scale according to workload fluctuations [5].

The rapid development of cloud, edge, and Internet of Things (IoT) computing has created a strong need for unified orchestration frameworks that can manage extremely heterogeneous, distributed, and resource-constrained environments. Traditional cloud-native orchestration systems present significant limitations, as they are not inherently built to facilitate autonomous, decentralized decision-making throughout the computing continuum or to seamlessly integrate non-container-native devices [6]. Moving from cloud-centric clusters to this diverse continuum introduces several critical challenges [7]: fast scaling and low-latency processing for many workloads remain difficult when resources are highly distributed; the efficient use of heterogeneous and sometimes constrained hardware, particularly in edge environments, is an active research problem; and finally, achieving interoperability between various protocols [8] and device types requires suitable abstractions for networking, communication, and resource representation.

To address these limitations, the Distributed Adaptive Cloud Continuum Architecture (DACCA) is introduced as a novel Kubernetes-native framework designed to extend orchestration capabilities beyond the data center to encompass the entire continuum of highly heterogeneous infrastructures. We facilitate a paradigm shift from centralized management to decentralized, intelligent operation through the integration of decentralized self-awareness and swarm formation. This capability enables adaptive and resilient operation, supporting fast scaling and low-latency processing where infrastructure conditions permit, while acknowledging the trade-offs introduced by decentralization and heterogeneous resource constraints.. The architecture also establishes a resource and application abstraction layer to ensure uniform resource representation and resolve interoperability challenges, enabling seamless integration of constrained hardware. Furthermore, the distributed and adaptive resource optimization (DARO) framework, which utilizes multi-agent reinforcement learning (MARL), ensures intelligent scheduling and optimized resource utilization. Finally, to guarantee verifiable identity, access control, and tamper-proof data exchange across heterogeneous domains, DACCA is reinforced by a zero-trust security framework based on distributed ledger technology (DLT).

The primary contribution of this paper is the design and formalization of a decentralized, self-aware orchestration model for the cloud–edge–IoT continuum, implemented as a Kubernetes-native architecture. DACCA is centered on the insight that autonomous continuum-scale operations cannot be achieved through isolated solutions but instead require the tight integration of decentralized awareness, adaptive coordination, and learning-driven decision-making under heterogeneity and partial observability. To realize this model, DACCA integrates four key, interdependent architectural elements—decentralized self-awareness at the node level, dynamic swarm formation of nodes for adaptive, application-specific execution, distributed learning-based scheduling under partial observability (via the DARO framework), and a zero-trust security foundation—that spans diverse administrative and operational boundaries (via the DLT framework). Although existing systems address isolated aspects of this pipeline, such as basic monitoring, device onboarding, centralized scheduling, or siloed security, they fail to provide an integrated, holistic framework in which nodes and applications can collaborate, adapt, and optimize their operations collectively across the entire continuum. DACCA supports this necessary integrated view, offering a concrete architectural blueprint toward increasingly autonomous and self-managing cloud–edge–IoT platforms.

The remainder of the paper is organized as follows. Section 2 provides background information and related work. Section 3 presents an overview of the proposed architecture. Section 4 discusses the model of resources and application data, which are then used for discovery, management, and interoperability in Section 5. Section 6 introduces the mechanisms that equip nodes with self-awareness and autonomous decision-making capabilities. Section 7 presents the Distributed and Adaptive Resource Optimization (DARO) framework. Section 8 describes the decentralized trust and security framework. Section 9 outlines further considerations and future work, while Section 10 concludes the article.

2. Background and Related Work

2.1. Computing Continuum Orchestration

Containerization and virtualization are core technologies that enable the abstraction, packaging, and deployment of applications in isolated environments. Many orchestration platforms and container engines exist to execute isolated applications and deploy them in various environments. Additionally, containerization significantly ensures portability, scalability, and efficient management of workloads throughout the computing continuum.

We have studied existing orchestration and management tools and made comparisons to reach a consensus on the best existing platform to build on. The comparison criteria were the out-of-the-box capabilities, scalability options, flexibility, degree of decentralization, and intended operational scope, as well as community support.. Additionally, we preferred open-source solutions but also studied and considered mainstream proprietary engines. We opted to use Kubernetes as a basis to build upon. In the following paragraphs, we provide a comparison of other orchestration platforms with Kubernetes. This section provides an overview of some of the most popular container orchestration engines, highlighting their similarities and differences.

Kubernetes (K8s) [1] is an open-source container orchestration system to automate software deployment, scaling, and management. Originally designed by Google, the project is now maintained by a worldwide community of contributors, and the trademark is held by the Cloud Native Computing Foundation. Kubernetes provides a modular microservice-based architecture that supports rich scalability options and autoscalers. It offers very easy scaling mechanisms out of the box. It is a highly flexible and customizable engine with broad community support.

Parallel to Kubernetes is Amazon EKS (Amazon Elastic Kubernetes Service) [9]. EKS provides seamless integration with AWS services, comes with managed updates, scaling, and self-healing features, and has a large and active community. We did not proceed with EKS, as it is more complex to set up during its initial phase compared to Kubernetes, requires a good understanding of AWS (Amazon Web Services), and is costly for large-scale deployments.

Similarly to EKS, we looked into AKS (Azure Kubernetes Service) [10], which natively integrates with Azure services, provides managed updates, scaling, and self-healing features, comes with very strong security mechanisms via Azure AD (Microsoft Entra ID), and also has an active community of developers. The cons are the same as in EKS: the complexity of setting up, the required deep knowledge of Azure infrastructure, and the high cost for large-scale deployments, all of which hinder the adoption of this option.

Another option we considered was Nomad [11]. Nomad is an orchestration engine maintained by HashiCorp that manages containerized applications within a single workflow. Although it is very simple to use, a key difference with Kubernetes is that Nomad focuses primarily on cluster management and scheduling. Kubernetes, on the other hand, provides a sophisticated ecosystem for managing clusters, in addition to deploying applications, including service discovery, monitoring, and other resource management capabilities.

OpenStack [12] is another alternative, maintained by the OpenStack Foundation. OpenStack offers support not only for containerized applications but also for virtual machines and bare metal management. However, it is also complex to use and maintain, especially in relation to Kubernetes or Nomad.

OpenShift [3] by Red Hat is also a known engine, providing advanced DevOps and CI/CD tools. Strong security and enterprise features are available, promoting hybrid cloud environments supported by a large and active community. However, it is a proprietary solution that requires knowledge of Red Hat infrastructure and is complex to set up and manage.

Another option we heavily considered was KubeEdge [13]. This orchestration engine is based on Kubernetes, and, by design, it is additionally a strong option for supporting edge computing. Also, KubeEdge’s architecture is designed to operate utilizing device controllers that enable edge devices to communicate with the cloud. An important and key limitation is that KubeEdge is only intended to integrate the cloud with diverse edge devices and does not extend its capabilities to deploying non-containerized applications to the edge and IoT devices. To achieve this, it is necessary to take a step back and redesign the device controllers to support this modular behavior. This reflects a design assumption in which edge resources remain cloud-managed extensions rather than autonomous participants in a decentralized orchestration continuum.

Finally, we also considered Docker Swarm [2], which provides extensive tools and ecosystem support, is simple to use, integrates with Docker Hub, and has a vast active community for support. However, it is much less popular than Kubernetes, requires additional setup for orchestration functionalities, and security and scalability features pose significant concerns for large-scale deployments.

After conducting an exhaustive study and comparison, we opted to use Kubernetes as the base orchestration engine and extend its capabilities to support more heterogeneous and multimodal devices as nodes. In addition, we will demonstrate how our proposed architecture further enhances security and trust, as well as the monitoring of nodes and ease of cluster management. The comparison focuses on architectural scope and design assumptions rather than absolute performance, as latency, energy efficiency, and scheduling optimality depend strongly on deployment context and workload characteristics. A structured comparison of DACCA and representative orchestration platforms is provided in Table 1.

It is important to mention that none of the above orchestration engines has a straightforward or out-of-the-box feature to support deploying workloads to diverse edge devices (including smartphones and Android OS-based devices, in general). The existing scheduling components heavily (and only) rely on CPU, memory, and GPU utilization (as seen in Kubernetes and also applied in AWS payment policies) and thus do not provide a more sophisticated scheduler that takes various aspects into account. The ultimate aim is to support diverse applications on a cloud infrastructure containing multi-modal resources and devices, while taking security and trust into account. Our proposed architecture is based on Kubernetes, as it is the most popular open-source solution, addressing all these issues to sustain such requirements.

2.2. Decentralized Self-Awareness

Decentralized self-awareness is a foundational concept in autonomic and self-managing systems, enabling computing entities to monitor, analyze, and reason about their own operational state without relying on centralized control mechanisms [14,15]. In large-scale cloud–edge–IoT continuum environments, full global observability is often infeasible due to scale, heterogeneity, and dynamic resource conditions, making purely centralized monitoring and decision-making approaches difficult to sustain [16].

To address these challenges, prior work has explored decentralized and agent-based self-aware mechanisms, where individual nodes maintain local or partial views of the system state and adapt their behavior accordingly [17]. Such approaches typically rely on lightweight monitoring, aggregation, or information exchange schemes, often based on gossiping or partial view dissemination to balance scalability and responsiveness under partial observability [18]. These mechanisms have been shown to improve robustness and scalability compared to centralized alternatives, particularly in highly distributed environments.

Despite these advances, existing decentralized self-awareness solutions are often tightly coupled to specific domains or assume limited interaction with higher-level orchestration mechanisms. In cloud–edge settings, self-awareness is frequently treated as an auxiliary monitoring function rather than a first-class architectural component that directly informs resource orchestration and adaptation decisions [19]. This gap motivates architectures that explicitly integrate decentralized self-awareness with orchestration and optimization layers, as pursued in the proposed architecture.

2.3. Edge-Aware and Distributed Scheduling

Edge computing environments introduce significant challenges for task scheduling due to heterogeneous resource capabilities, highly dynamic workloads, and stringent latency and energy constraints. While Kubernetes has become the de facto orchestration platform across cloud and edge deployments, its default scheduling mechanisms are largely agnostic to network conditions, energy efficiency, and user mobility. This limitation has motivated a growing body of work on edge-aware and distributed scheduling extensions. Recent heuristic-based approaches focus on enhancing Kubernetes with lightweight, adaptive mechanisms that improve responsiveness without incurring the overhead of complex optimization. For example, Tsokov and Kostadinov propose Kinitos, a network-aware scheduler and descheduler that continuously migrates microservice pods across edge nodes to satisfy latency objectives under node mobility and fluctuating network conditions, achieving substantial reductions in end-to-end response time compared to the default Kubernetes scheduler [20]. Similarly, Ali et al. introduce the EdgeBus framework, which incorporates a mobility-aware node migration orchestrator to heuristically balance latency reduction against migration overhead in heterogeneous mobile edge environments [21]. Beyond latency-focused heuristics, hybrid optimization approaches have been explored to jointly address performance and energy efficiency. By combining classical placement heuristics with evolutionary optimization techniques, such methods improve resource consolidation and reduce energy consumption while maintaining acceptable task completion times in large-scale edge clusters [22].

In parallel, learning-based scheduling approaches have gained increasing attention as a means to cope with the non-stationary and partially observable nature of cloud–edge systems. Deep reinforcement learning (DRL) techniques, in particular, enable schedulers to continuously adapt placement decisions based on real-time system feedback. Wang et al. integrate a proximal policy optimization (PPO) agent into Kubernetes to minimize task response time in edge clusters, demonstrating significant latency improvements under high load conditions [23]. Other works extend this paradigm toward multi-objective optimization, explicitly balancing latency, energy consumption, and fairness. The Smart-Kube scheduler employs DRL to achieve energy-aware and fair job placement, reducing power usage while preserving performance isolation across workloads [24]. More generally, reinforcement learning-based schedulers model the placement problem as a sequential decision process, enabling the joint optimization of competing objectives such as execution delay, energy efficiency, and operational cost in heterogeneous cloud–edge infrastructures [25]. These approaches consistently outperform static heuristics in dynamic environments, though this comes with increased training complexity and requires carefully crafted reward functions.

2.4. Decentralized Trust and Security Mechanisms

Trust and security management remain open challenges in heterogeneous and distributed environments due to their inherent decentralization, heterogeneity, and multi-domain nature. Traditional security solutions were largely designed for centralized cloud infrastructures and typically rely on trusted authorities for identity management, policy definition, and access enforcement. While such approaches are effective in single-domain settings, they introduce single points of failure and scale poorly when extended to federated or highly dynamic edge deployments [26].

To mitigate these limitations, several works have proposed decentralized and federated security models for distributed systems. Early efforts focus on extending public key infrastructures, federated identity management, and role-based access control mechanisms to edge and IoT scenarios. However, these solutions often assume stable connectivity and a limited number of trusted administrative domains, making them difficult to apply in large-scale, cross-organizational computing continuums.

More recently, zero-trust security paradigms have gained attention as a means to improve resilience in distributed environments. Zero-trust approaches eliminate implicit trust based on network location and instead require continuous verification of identity, context, and authorization for every access request. Academic and industrial solutions adopting zero-trust principles typically employ context-aware access control, secure gateways, and continuous authentication. While these approaches improve protection against insider threats and lateral movement, most existing zero-trust implementations still depend on centralized identity providers and policy repositories, which limits transparency, auditability, and cross-domain consistency [27].

Alongside zero-trust developments, decentralized identity and access management solutions have emerged, motivated by the need for stronger autonomy and interoperability. Self-sovereign identity models and verifiable credentials aim to give entities greater control over their identities without relying on centralized authorities. In edge and IoT contexts, such approaches have been shown to improve scalability and reduce trust assumptions. Nevertheless, many existing proposals focus primarily on identity representation and authentication, leaving policy management, enforcement consistency, and auditability as open challenges.

Distributed Ledger Technology (DLT) has been explored as a foundational mechanism to address these issues by providing immutable, tamper-resistant storage for identities, credentials, and access control policies. Prior research demonstrates that DLT-based security frameworks can enhance transparency, enable decentralized trust establishment, and support verifiable audit trails across administrative domains. However, existing solutions often remain application-specific, introduce significant overhead, or lack tight integration with orchestration and resource management layers in cloud–edge–IoT systems [28].

Overall, while substantial progress has been made toward decentralized and zero-trust security models, current approaches typically address isolated aspects of trust management—such as identity, authorization, or auditing—without offering an integrated framework that aligns security enforcement with orchestration and scheduling across the computing continuum. This gap motivates the need for security architectures that combine decentralized trust, continuous verification, and orchestration-aware enforcement in a unified and scalable manner.

3. Architecture Overview

This paper presents the architecture of a distributed and adaptive cloud computing continuum that aims to extend and enhance existing cloud orchestration concepts, ideas, and approaches. Our proposed architecture is guided by the principles of modularity, scalability, and self-management. The ultimate goal is the seamless execution of data-intensive applications utilizing the full spectrum of the computing continuum, spanning from resource-constrained IoT devices at the edge to high-performance cloud infrastructures. The proposed architecture follows a layered, service-oriented approach, abstracting resource heterogeneity and promoting interoperability of interfaces and resource abstractions.

Our proposed architecture aims to offer a computing continuum landscape, achieving the seamless integration of diverse computational platforms:

IoT Devices and On-Device Nodes: Devices such as sensors, AR glasses, and microcontrollers are integrated via lightweight communication protocols and connectors that translate their capabilities into Kubernetes-compatible descriptors.
Edge Infrastructure: Near- and far-edge nodes are equipped with enhanced awareness and resource monitoring capabilities, allowing them to participate in autonomous cluster formation and workload execution.
Cloud Infrastructure: High-performance centralized resources provide additional computing power and storage capacity for tasks requiring global state, deep analytics, or large-scale inference.

3.1. Kubernetes as a Basis

In our approach, we choose Kubernetes as the most suitable basis to build upon our architecture, as it enables workload orchestration, scalability, and optimal self-management of nodes. To justify our selection, we identify Kubernetes as presenting the following characteristics:

1.: Orchestration of Heterogeneous Computing Nodes: Kubernetes contains flexible scheduling and networking policies that can enable heterogeneous nodes (i.e., nodes with very different resources) to participate in a unified ecosystem. Our architecture will extend these capabilities to integrate more types of computing nodes, such as Android Smartphone devices, Raspberry Pi devices, or other Single-Board Computers.
2.: Dynamic Resource Allocation: Kubernetes provides a scheduler for resource management that assigns workloads and pods to appropriate nodes, taking into account constraints and enforcing policies. Decisions are strongly based on resource availability, latency constraints, custom policies, etc., allowing for dynamic reallocation of workloads, which optimizes execution of compute-intensive tasks in heterogeneous environments
3.: Scalability: An elastic infrastructure capable of dynamically scaling applications poses a strong requirement in our architecture. Accommodating workloads on heterogeneous nodes requires horizontal and vertical autoscalers based on predefined policies, optimizing resource utilization across the entire computing continuum.
4.: Advertising and Node Awareness: Nodes can announce their availability, capabilities, and resource constraints throughout the entire computing continuum. This ensures that processing workloads are assigned to the most suitable nodes, taking into account resource availability, latency, and energy efficiency. Kubernetes’ built-in mechanisms (node labels, taints, tolerations, and Custom Resource Definitions) effectively allow for the required node discovery and workload provisioning.
5.: Resilience and Self-Healing: Kubernetes implements controllers for its self-healing mechanisms that automatically detect and recover from failures, ensuring high availability of applications.
6.: Cloud-to-Edge Compatibility: Kubernetes ensures integration between cloud, edge, and IoT nodes by its ability to manage deployments across different environments. These deployments can be extended to heterogeneous environments through specialized configurations and extensions.
7.: Security and Access Control: Security is the cornerstone of our proposed decentralized architecture. Kubernetes offers built-in Role-Based Access Control (RBAC), network policies, and encryption mechanisms out of the box, ensuring data integrity and preventing unauthorized access across the computing continuum.
8.: Service Discovery and Load Balancing: Kubernetes provides a robust service discovery and load balancing framework, ensuring efficient inter-service communication across distributed computing nodes.

3.2. Scheduling and Scalability

A significant advantage of using Kubernetes in our architecture is that it already provides sophisticated mechanisms and core components to enhance workload distribution, scalability, and overall adaptability. The Kubernetes Scheduler allocates applications and resources, considering current resource utilization, latency requirements, and energy constraints. The execution of applications is optimized to achieve efficient execution in cloud-to-edge environments.

Kubernetes also provides the concept of autoscalers, packaged into the following mechanisms:

Cluster Autoscaler (CA): the required number of nodes can be dynamically adjusted, based on demand.
Horizontal Pod Autoscaler (HPA): the number of pod replicas is automatically adjusted based on CPU and memory utilization.
Vertical Pod Autoscaler (VPA): the resource limits are adjusted for each individual pod for optimal resource reallocation while maintaining application performance.

Kubernetes can also isolate workloads logically, using namespaces. This allows its architecture to accommodate and easily manage diverse applications and workloads while maintaining security and isolation in different execution environments.

3.3. Top-Level Architecture

This section outlines the overarching architecture of the proposed DACCA framework, leveraging the fundamental capabilities of Kubernetes along with its scalability and scheduling mechanisms. The architecture is structured in layers employing a bottom-up design that enhances modularity, interoperability, and extensibility throughout the computing continuum. Each layer encompasses distinct functionalities, including device discovery, resource abstraction, decentralized decision-making, adaptive scheduling, and trust enforcement, creating a cohesive ecosystem that connects heterogeneous infrastructures. This overview presents the primary architectural layers and demonstrates how their coordinated functioning facilitates seamless orchestration and intelligent workload management across diverse environments.

Our proposed architecture, visualized in Figure 1, consists of the following discrete layers, following a bottom-up approach (the top layer is the Application layer).

1.: The Applications layer, which consists of the various applications and workloads that are submitted to our platform.
2.: The Decentralized Trust and Security layer, containing a security framework designed to implement a fully decentralized trust and security model combining Distributed Ledger Technology (DLT) with the principles of Zero-Trust.
3.: The Distributed and Adaptive Resource Scheduler (DARO), which is a Kubernetes-native scheduling system designed to support real-time resource management and task scheduling in large-scale heterogeneous environments. DARO is designed to operate effectively in resource-constrained, non-stationary, and cloud–edge–IoT environments.
4.: Decentralized Self-Awareness and Swarm Formation layer, which introduces mechanisms equipping nodes with self-awareness and autonomous decision-making capabilities through decentralized monitoring and dynamic swarm formation.
5.: The Node Discovery, Management, and Interoperability layer, which offers critical resource abstraction mechanisms for the discovery, management, and interoperability of nodes.

These layers create a unified and modular framework that facilitates seamless interoperability and intelligent orchestration throughout the computing continuum. The foundational lower layers, which facilitate node discovery, management, and interoperability, abstract device heterogeneity, enabling both container-native and non-container-native nodes to uniformly engage in the cluster. The decentralized self-awareness and swarm formation layer facilitate adaptive intelligence, enabling nodes to independently assess their status, cooperate, and dynamically organize into logical clusters based on workload demands. The distributed and adaptive resource scheduler enhances this foundation by facilitating efficient workload allocation via decentralized, learning-based decision-making, thereby aiming to improve resource utilization and scalability across various environments. Lastly, the decentralized trust and security layer ensures that all system interactions are secure and verifiable by integrating distributed identity management, zero-trust principles, and tamper-proof auditability. When combined, these layers allow us to fulfill its main objective, which is to provide a reliable, flexible, and robust orchestration framework for hyper-distributed heterogeneous systems.

4. Resource and Application Abstraction

4.1. Node Data Model

In this section, we describe the data models developed for implementing the resource abstraction and application layers. Cluster nodes, divided into physical and virtual machines, edge devices, and IoT devices, are represented in the data models. The application model represents the application execution framework necessary for executing applications within the proposed architecture. An overview of the proposed DACCA node data model is illustrated in Figure 2.

The data models adopt traditional Kubernetes core cluster resources, configuration details, status, and conditions for cluster operations. A node has the following default sections:

The metadata (ObjectMeta), which is the metadata section associated with a node and describes identifying attributes for querying, monitoring, and managing each node (e.g., name, labels, creation timestamp).
The spec (NodeSpec), describing how Kubernetes should treat a node in terms of scheduling, networking, and provider information, and depicting the desired state of each node (e.g., scheduling policies, provider details).
The status (NodeStatus), which is the current actual status of a node. Attributes provide real-time information about the resource capacity, connectivity, and health of a node (e.g., scheduling policies, provider details). The following tables provide examples of node conditions (Table 2) and node addresses (Table 3).

In our proposed architecture, we make use of the existing Kubernetes node data model and extend it by considering a categorization of a node using two abstraction entities: the Native Node and the Device Node. In our model, the Native Nodes constitute the traditional Kubernetes worker nodes capable of running containerized applications (i.e., pods). The Device Nodes refer to computational resources that cannot become Kubernetes nodes natively (i.e., either they are not capable of executing containers or do not support the minimum computational requirements).

4.1.1. Native Nodes Model

As already presented, the Native Nodes utilize the traditional Kubernetes node objects, which include:

Metadata (labels, annotations, name, etc.),
Spec (pod CIDR, taints, etc.),
Status (pod CIDR, taints, etc.).

However, this existing model focuses mainly on CPU, memory, ephemeral storage, and GPU utilization [29]. The Native Node model expands the Kubernetes node model by introducing additional attributes that describe energy costs, sensors attached to nodes, network interfaces, and other properties, which result from including a broader range of devices in the cluster. The Native Node model uses additional node fields, labels, annotations, and taints to represent additional node properties (see Figure 3).

4.1.2. Device Nodes Model

Device Nodes, which represent edge devices, do not directly participate in the Kubernetes cluster as Native Nodes can. Therefore, we use Custom Resource Definitions (CRD, [30]) in the Kubernetes Control Plane to register the Device Nodes. To achieve a proper structure, we adhere to the standard Kubernetes metadata, spec, and status patterns.

4.2. Application Profile Model

Deploying applications across textredheterogeneous and distributed environments requires a clear and consistent way of describing what an application needs in order to run correctly. The Application Profile Model (APM) addresses this requirement by providing a structured, machine-readable contract that captures the key properties of an application—its operational behavior, resource requirements, constraints, and execution context. These contracts, which we refer to as Application Profiles, ensure that all components involved in the deployment process interpret application requirements in the same way.

Application Profiles are built directly on the Application Data Model and offer a practical mechanism for expressing all relevant attributes in a predictable and schema-driven format. As shown in Figure 4, the APM defines a clear structure that describes runtime settings, compute and accelerator needs, networking dependencies, constraints on placement, and Quality-of-Service (QoS) expectations. This structured representation makes it possible to validate, store, and reuse profiles across different platforms and deployment scenarios.

The APM’s core objectives are:

Standardizing and validating application metadata: Profiles are written in YAML for readability but are validated against versioned JSON Schemas. This approach ensures that they follow a well-defined structure and contain all necessary fields, reducing the likelihood of errors when the profiles are processed by orchestration tools.
Supporting integration with orchestrators [31]: The model is compatible with a wide range of applications, from Kubernetes-native workloads to device-specific and multi-service integration workflows. It supports both containerized and non-containerized execution, enabling a common representation across different technologies and environments.
Ensuring applicability across the computing continuum [32]: The profile model is intentionally broad so it can describe applications intended for cloud servers, edge devices, or IoT endpoints. This allows an application to be defined once and then deployed wherever the required resources and conditions are met.

Each Application Profile is organized into three main sections, following the same structure used for the Node Data Model in Section 4.1:

1.: Metadata: Contains identifiers, annotations, and descriptive fields consistent with Kubernetes-style metadata conventions. This information supports indexing, categorization, and traceability across tools and platforms.
2.: Specs: Describes the desired state of the application, including runtime configuration and resource requirements such as CPU, memory, GPU, and accelerators, as well as networking needs, deployment constraints, sensor dependencies, and QoS expectations. This section provides orchestrators with all the information needed to make informed scheduling and placement decisions.
3.: Status: Represents the observed state during execution to ensure completeness and support monitoring and lifecycle management.

5. Node Discovery, Management, and Interoperability

This section is intended to describe the resources abstraction mechanisms required for the discovery, management, and interoperability of nodes. These mechanisms implement the functionalities to create the above-mentioned resource and application abstraction layer described in the previous section. The mechanisms developed are the discovery of nodes, their management, and the functionalities required to achieve smooth provisioning of applications and interoperability.

5.1. Mechanisms to Register, Discover, and Manage Heterogeneous Nodes

A self-advertisement mechanism is implemented to enable the registration, discovery, and management of heterogeneous nodes. Every new node must be seamlessly, robustly, and securely added to the cluster with almost no user intervention. This enables a seamless workload distribution without the need for manual configurations. In the event that a node is unregistered or unexpectedly goes offline (due to an error or for maintenance), workloads are automatically redistributed across the remaining nodes. This leads to uninterrupted operations and overall cluster functionality.

In Kubernetes and other orchestration platforms, the management and addition of new nodes is not a straightforward process, despite the availability of partial automation for this process. This happens because of the assumption that all nodes are managed by administrators or system operators in stages to ensure reliability. In our proposed architecture, we provide a more user-friendly, robust, and streamlined approach to dynamically adding nodes. Our proposed solution, Hypertool, is a CLI tool developed to automate the discovery and registration of heterogeneous nodes. This tool facilitated all the required steps, from the dynamic creation of the node representation to its registration. Two output modes are provided: the Default, which displays primary steps and results for standard operations, and the Verbose, which provides detailed information about each step with diagnostics and logs for advanced troubleshooting and insights. Hypertool offers a secure, robust and efficient mechanism for integrating new nodes into the cluster, thus significantly reducing the complexity and manual effort required by users.

5.2. Cognitive Cloud Softwarized Infrastructure (Connectors)

In this section, we describe the implementation and architectural impact of Open Connectors, an abstraction layer on top of the cluster nodes. We describe a software-based infrastructure that aims to interconnect multi-modal entities joining the cluster as nodes. The Open Connectors are implemented to enhance the interoperability of nodes and extend the management of computational, network, and storage resources.

The Open Connectors are implemented to create a mechanism for edge devices to join the cluster as nodes, although they are not capable of natively executing containers. Any computational resource with the crucial limitation mentioned above cannot become a node in any Kubernetes-based cluster. The Open Connectors enable these kinds of devices to be registered as nodes and be available for scheduling applications to them. The implemented abstraction layer aims to extend the capabilities of existing Kubernetes orchestration engines by describing edge devices using familiar Kubernetes annotations.

The Open Connectors manage the self-advertisement and registration of the edge devices as Kubernetes custom resources and maintain the required network communications for monitoring them and delegating applications to them. This mechanism is intended for edge devices that are not capable of natively joining a Kubernetes cluster due to hardware limitations or the inability to execute any container runtime. The devices targeted to be registered as cluster nodes include Android OS-based devices (such as smartphones, AR glasses, and tablets) and Linux OS-based devices (e.g., Raspberry Pi). Edge devices are described in the Kubernetes cluster using Kubernetes Custom Resource Definitions (CRD). Two entities related to Open Connectors exist: the DeviceNode and the Application, both of which are described in their corresponding CRDs. Its entity has its corresponding controller for managing the CRD and the custom resources submitted.

The Open Connectors architecture is presented in Figure 5 and consists of two main components: a controller and a communication connector. The controller operates in a Kubernetes-like approach [33], maintaining the Custom Resources created for every edge device and for the applications submitted intended for these devices. The communication connector is used to receive information from the edge devices and to dispatch applications to them. All communications between edge devices use MQTT brokers such as Eclipse Mosquitto [34], as they are lightweight and efficient. The controllers are divided into two categories: the Device Node and the Application controller. The Device Node controller creates and maintains Custom Resources, one per Device Node, and the Application controller transforms submitted applications into Custom Resources, managing their life cycle.

The Open Connectors operate as follows: A candidate edge device sends its details, device information, and available resources (hardware capabilities and onboard sensors) to the cluster. The Open Connectors device controller receives and validates this information, formulates a new custom resource, and submits it to the Kubernetes Control Plane via the API [35]. If the custom resource is accepted, it is stored in the etcd database and is considered an available cluster resource (i.e., traditional kubectl commands can obtain the resource and its description). The Device Node controller’s communication connector receives the status of every registered Device Node at an interval of 5 min per device and patches this information into the corresponding custom resource. In case the edge device fails to send its status, the Device Node is considered unavailable.

Similarly, the Application Controller intercepts the submission of applications intended to execute on a Device Node. Upon submission of such an application, the controller creates a custom resource containing all application information (edge device OS, application URL, required sensors, location, etc.) and leaves a specific field blank, which is the UUID of a candidate Device Node. The Open Connectors filter out the available Device Nodes based on their current resource utilization and hardware and sensor constraints and fill the Device Node UUID attribute in the Application custom resource. This patch in the custom resource triggers the Applications controller’s communication connector to create the appropriate payload and send the application details to the selected Device Node. The sequence diagram for Open Connectors is visualized in Figure 6.

6. Decentralized Self-Awareness and Swarm Formation

In this section, we introduce mechanisms that equip nodes with self-awareness and autonomous decision-making capabilities through decentralized monitoring and dynamic swarm formation. More specifically, we present a layered approach in which each node monitors its own state, curates anomalies, detects alarming events using embedded intelligence, and sends compressed state information to peer nodes. Thus, each node is capable of having a reliable estimation of the state of the system. Finally, building on this foundation, we present a swarm formation mechanism that dynamically groups nodes to meet application requirements, ensuring adaptive execution and resilience across heterogeneous and resource-constrained infrastructures.

6.1. Decentralized Monitoring and Self-Awareness

Creating a system that spans cloud infrastructures, edge settings, and IoT devices in a decentralized manner requires each node within it to be able to autonomously monitor, analyze, and communicate its own status. This level of self-awareness is crucial, as it enables each node to understand its capabilities and how it can contribute to the system. This can be achieved by placing lightweight modules on each node, with the goal of continuously collecting and examining operational data, such as CPU and memory usage, network latency, and energy levels. Thus, each node maintains an up-to-date and detailed view of its own health and capabilities.

To enable self-awareness in each node, the first step is to deploy an Anomaly Curation mechanism, which will allow continuous monitoring of each node’s raw metrics and ensure high-quality, curated information for the entire architecture in a decentralized manner. This module will be responsible for combining various heterogeneous and cross-domain raw metrics from each node, collected from different sensors and devices, into a unified embedded representation.

The Anomaly Curation mechanism comprises three main components: (1) a Metrics/Metadata Logger that actively collects node-level metrics from sources such as system utilities, Kubernetes cAdvisor, and Prometheus Node Exporter at configurable intervals; (2) a Data Curation Module that performs noise filtering, data cleaning, and feature extraction; and (3) an Anomaly Detection Pipeline based on self-supervised learning. The collected metrics are stored as timestamped CSV files within node-local persistent volumes and processed for compatibility with downstream analysis modules.

For the initial Metadata Logger, targeted metrics span multiple domains, including CPU utilization and throttling, disk I/O operations, memory usage (free, active, cached, swap), network packet transmission and errors, system context switches, thermal readings, and cooling device states. The Data Curation Module processes collected raw metrics, performing data cleaning, filtering, and meaningful feature extraction (e.g., average disk usage over rolling windows, weighted CPU core utilization) while also normalizing its metric over a time period to construct unified embedded representations.

Additionally, the Anomaly Detection Pipeline builds upon the aforementioned collected curated data and enhances the capabilities of self-awareness in the node. This is achieved by comparing real-time measurements with patterns in the historical data, thereby identifying inconsistencies and patterned anomalies. Using different AI architecture models specialized in understanding time series patterns [36,37], it is possible to generate case-specific annotations by defining anomalies by checking if a node is in a critical or uncommon state. By embedding these annotations directly into the node metrics, nodes can understand their operational state and capabilities, thereby improving their decision-making in other parts of the system.

The Anomaly Detection Pipeline is based on the CARLA (Contrastive Augmentation for self-supervised Learning in multivariate time-series Anomaly detection) self-supervised framework [38]. CARLA classifies time series windows as normal or anomalous based on representations learned during training.

Nevertheless, knowledge of a node’s capabilities is insufficient to understand how it fits into the entire system compared to the other nodes in the cluster. This gives rise to the need for a Full-State Estimation mechanism, so that each node has a sense of the whole system it resides in. Using autoencoder architectures [39], each node encodes its own curated state data and possible annotations coming from the aforementioned mechanisms, resulting in a compact/compressed state representation vector. Then each node sends these embeddings to all other cluster nodes via the default Kubernetes API or a peer-to-peer protocol [40], in order to reduce the communication overhead. After receiving this information, the nodes reconstruct the initial state vector of their peers through a decoder. This mechanism ensures that each node has a comprehensive view of the whole system without violating bandwidth constraints.

6.2. Conditional Autoencoder for State Estimation

State compression is implemented using a conditional autoencoder, which explicitly accounts for node heterogeneity. Each node encodes its curated state vector, comprising resource and performance indicators such as CPU, memory, storage, bandwidth, latency, packet loss, and uptime, into a low-dimensional latent embedding. Conditioning is achieved by augmenting the encoder and decoder inputs with a node-category indicator (e.g., server, laptop, IoT device, etc.), enabling a single shared model to represent heterogeneous devices within a unified embedding space.

To provide an initial indication of feasibility, we report illustrative results obtained from a synthetic heterogeneous setup including multiple node categories. The results show that accurate reconstruction of node states can be achieved using low-dimensional embeddings, while significantly reducing the amount of exchanged state information compared to full metric dissemination. More specifically, macro-accuracy is computed as the average reconstruction accuracy across various evaluation time steps. As shown in Figure 7, macro-accuracy increases with the embedding dimension, indicating improved reconstruction fidelity over the full evaluation horizon. These findings suggest that the proposed conditional autoencoder can effectively capture heterogeneous resource characteristics while maintaining communication efficiency. The results are not intended as a comprehensive evaluation. More extensive validation under larger-scale and more dynamic deployments is planned as future work.

Implementation artifacts and additional technical specifications are available in [41], which includes component source code and deployment configurations.

Kubernetes integration of the above modules is achieved by deploying each as a distinct DaemonSet, ensuring that they run consistently on every node and allowing for seamless scaling across heterogeneous infrastructures. The entire framework extends the in-place Kubernetes metrics system, adding new insights, without disrupting standard operations. The envisioned architecture is depicted in Figure 8, illustrating the placement of the aforementioned modules within each node and their coexistence with in-node Kubernetes components, such as kubelet and kube proxy. Through this design, the platform achieves a decentralized foundation for resilient, scalable, and intelligent self-management throughout the computing continuum.

6.3. Swarm Formation

Critical applications require guaranteed resources and high reliability. By dynamically forming swarms of nodes (logical clusters within the physical cluster), each dedicated to an application’s requirements, the system aims to support reliable execution even under fluctuating workloads and heterogeneous infrastructure conditions. Forming a suitable logical cluster for each application’s execution requires global knowledge and coordination. In our approach, this is handled by selecting a leader node [42] among the worker nodes to be responsible for creating the logical cluster that matches the application’s needs.

Every time an application arrives at the Control Plane, the Logical Cluster Manager selects a leader node among the worker nodes through a rotating leader strategy, periodically designating different nodes to act as leaders, as long as they meet reliability criteria. The selected leader then initiates the logical cluster formation process by sending the application’s requirements, including CPU and memory, to the rest of the worker nodes in the system. Each node decides to offer to participate in the logical cluster based on the requirements received and its own computational resources.

After the leader node has received the responses from the worker nodes (with respect to a timeout limit), it formulates and solves a Mixed-Integer Linear Programming (MILP) problem [43] to determine the optimal set of nodes to form the logical cluster. This ensures that the selected group can meet the application’s demand while respecting the constraints related to the tasks comprising it, thereby avoiding unnecessary resource fragmentation and preserving capacity for future applications. When the logical cluster is finalized, the leader node informs the Control Plane about the outcome. While the application is running, the leader node continuously monitors for node failures and demand increases/decreases and partially reruns the MILP process to dynamically reshape the cluster without disrupting the application’s execution.

To make the swarm formation process explicit, we next formalize the leader’s decision-making step as a lightweight Mixed-Integer Linear Program (MILP) that selects a feasible and resource-efficient logical cluster based on the application profile and the offers received from candidate nodes.

MILP Formulation for Logical Cluster (Swarm) Formation

Swarm formation is performed by the elected leader node by solving a lightweight MILP that jointly determines (i) the set of nodes participating in the logical cluster and (ii) a feasible task-to-node assignment.

Let

N

denote the set of candidate nodes and

T

the set of tasks composing the application. Let

R

be the set of policy-relevant resources (e.g., CPU, RAM, memory). Each node

i \in N

advertises an available capacity

C_{i, r}

for each resource

r \in R

, while each task

t \in T

requires

d_{t, r}

units of resource r.

Binary decision variables

x_{i}

indicate whether node i is selected into the swarm, and

y_{t, i}

indicate whether task t is assigned to node i. Each task must be assigned to exactly one selected node:

\begin{matrix} \sum_{i \in N} y_{t, i} = 1, y_{t, i} \leq x_{i}, \forall t \in T, \forall i \in N . \end{matrix}

(1)

Resource feasibility is enforced through per-node capacity constraints:

\begin{matrix} \sum_{t \in T} d_{t, r} y_{t, i} \leq C_{i, r}, \forall i \in N, \forall r \in R . \end{matrix}

(2)

Overprovisioning is explicitly modeled at the swarm level. For each resource r, the excess capacity of the selected nodes over the aggregate application demand

D_{r} = \sum_{t \in T} d_{t, r}

is captured by a non-negative slack variable

o_{r}

:

\begin{matrix} \sum_{i \in N} C_{i, r} x_{i} - D_{r} = o_{r}, o_{r} \geq 0 . \end{matrix}

(3)

The objective minimizes a weighted combination of (i) the number of selected nodes, (ii) normalized overprovisioning across resources, and (iii) a utility-driven preference reflecting node-specific attributes encoded in a scalar utility score.

\begin{matrix} min γ_{1} \sum_{i \in N} x_{i} + γ_{2} \sum_{r \in R} \frac{o_{r}}{D_{r}} - γ_{3} \sum_{i \in N} x_{i} U_{i}, \end{matrix}

(4)

where

U_{i}

denotes the utility score of node i and

γ_{1}, γ_{2}, γ_{3}

are configuration weights.

The solution defines a feasible logical cluster together with a valid task assignment. The same formulation can be re-solved incrementally to adapt the swarm to node failures or demand changes without disrupting application execution.

To provide an initial indication of feasibility, we report illustrative results obtained from a synthetic swarm-formation setup with heterogeneous nodes and application requirements. The proposed MILP-based formulation is compared against a best-fit heuristic and a Kubernetes-like baseline to assess cluster compactness and decision overhead.

As shown in Figure 9, the proposed approach consistently forms smaller logical clusters, selecting significantly fewer nodes while still satisfying application resource constraints. This indicates that the formulation effectively avoids unnecessary over-allocation and produces tighter swarms compared to heuristic alternatives. Figure 10 reports the corresponding decision time, showing that swarm formation remains within sub-second latency for the evaluated configurations, confirming that the optimization step is practically deployable.

These results are not intended as a comprehensive performance evaluation; rather, they demonstrate the feasibility and scalability trends of the proposed swarm formation mechanism. More extensive experimentation under larger-scale and dynamic deployments is planned as future work.

Kubernetes integration of the above logic is accomplished by deploying Local Node Agents in each node as a DaemonSet and the Logical Cluster Manager as a custom controller [44] at the Control Plane. This framework seamlessly extends native Kubernetes resource management mechanisms, embedding advanced clustering and optimization capabilities without disrupting standard operations. The architecture is depicted in Figure 11, highlighting how the Logical Cluster Manager and Local Node Agents are placed within the Kubernetes environment alongside core components such as kubelet and kube-proxy. Through this design, the platform establishes a decentralized, yet coordinated, foundation for forming application-specific logical clusters, enabling resilient, adaptive, and efficient execution across the entire computing continuum.

Leader failure handling depends on the swarm lifecycle stage. If the elected leader becomes unavailable during swarm formation, the procedure is aborted and re-initiated by selecting a new eligible leader. If the leader fails after the logical cluster has been formed and the application is executing, the Control Plane appoints a new leader and provides it with the current logical cluster description and application profile, so that monitoring and subsequent reconfiguration decisions can continue without requiring full cluster reformation.

7. Distributed and Adaptive Resource Scheduling

This section introduces the Distributed and Adaptive Resource Optimization (DARO) framework, a Kubernetes-native scheduling system designed to support real-time resource management and task scheduling in large-scale heterogeneous environments [45]. DARO leverages a cooperative Multi-Agent Reinforcement Learning (MARL) approach [46] to overcome the limitations of centralized scheduling, such as scalability bottlenecks, single points of failure, and poor adaptability in dynamic workloads [47]. Unlike traditional centralized schedulers, DARO employs decentralized decision-making under partial observability, enabling it to operate effectively in resource-constrained, non-stationary, and cloud–edge–IoT environments [48]. Each agent learns to make local scheduling decisions while contributing to a globally efficient task allocation policy.

The choice of MARL is motivated by the inherently distributed and dynamic nature of cloud–edge–IoT infrastructures, where global system state is either unavailable or rapidly outdated. In such environments, centralized optimization becomes increasingly brittle as cluster size and heterogeneity grow. MARL enables scalable decision-making by decomposing the scheduling problem into locally solvable subproblems while still allowing agents to converge toward cooperative behavior through shared reward signals. This makes MARL particularly suitable for adaptive scheduling under partial observability, fluctuating resource availability, and decentralized control constraints.

7.1. Architectural Components

The primary objective of DARO is to perform resource allocation and task scheduling, considering various aspects, including the balance of resource utilization among distributed nodes and the proximity of tasks to the data they require (either from storage or other tasks). Its architecture follows an asynchronous Kubernetes-native pattern and consists of four building blocks: the Scheduler Broker, the Distributed Reinforcement Learning (RL) Agents, a Delayed Feedback Repository, and a set of Application Programming Interfaces (APIs). The DARO framework, including its core components and their interactions, is illustrated within the Kubernetes architecture in Figure 12.

The core of the system is the Scheduler Broker, which operates within the Kubernetes Control Plane and coordinates the scheduling process. It replaces the default scheduler and is responsible for managing task submissions, broadcasting task specifications to worker nodes, collecting responses from distributed agents, and finalizing scheduling decisions by interacting with the Kubernetes API server.

DARO adopts an auction-based interaction model between the Scheduler Broker and the distributed RL agents. This design choice provides a structured yet lightweight coordination mechanism that is well-suited for decentralized environments. Auctions enable each node to independently express its suitability for executing a task through a bid, based on locally observed resource conditions and learned policies, without requiring full global state knowledge.

Compared to monolithic optimization approaches, auction-based scheduling improves scalability and robustness by limiting communication overhead to concise bid exchanges and by avoiding tight coupling between decision-makers. In DARO, the auction mechanism serves as the coordination layer, reconciling decentralized agent decisions into a single, enforceable scheduling outcome that is compatible with Kubernetes’ control-plane semantics.

Each worker node hosts an autonomous Distributed Reinforcement Learning (RL) Agent that evaluates the incoming scheduling opportunities. These agents make independent decisions based on task information (e.g., resource requirements, data needs) as well as local observations, such as CPU and memory availability, current workload, data locality, and network conditions [49]. By learning over time, each agent improves its ability to bid for tasks in a way that balances resource efficiency, data transmission, and task performance.

In addition, DARO includes a Delayed Feedback Repository that stores structured execution records containing task outcomes (i.e., success or failure), resource consumption, execution latency, and periodic snapshots of the state of the system. This repository supports off-policy learning and delayed reward computation.

Finally, a set of Application Programming Interfaces (APIs) facilitates communication between DARO components and the broader Kubernetes system. These APIs allow the broker to bind pods, agents to access resource metrics, and external modules to interact with the scheduler, ensuring tight integration with the Kubernetes ecosystem and other architectural layers in the framework.

7.2. Task Scheduling Workflow

DARO replaces the default Kubernetes scheduling workflow with a decentralized, MARL-driven alternative, as shown in Figure 13:

1.: When a new pod is marked as unscheduled by the Kubernetes API server, the Scheduler Broker is notified and initiates the scheduling process.
2.: The Scheduler Broker filters eligible nodes based on hard constraints and broadcasts task metadata to the available RL Agents.
3.: Each RL agent computes a bid based on task information and local metrics (such as CPU, memory, and bandwidth) and sends it back to the Scheduler Broker.
4.: The Broker receives and ranks the bids, selects the most suitable node, and binds the pod via the Kubernetes API.
5.: The selected node’s kubelet retrieves the pod specification and launches the container.
6.: Relevant data, including task parameters, bid values, system state, and execution results, are captured by the Delayed Feedback Repository.
7.: The data from the Delayed Feedback Repository are used to compute reward signals that guide the agents’ learning and adaptation over time, thereby improving future scheduling decisions.

7.3. Reward Function and Learning Objective

The reward function in DARO is designed to promote balanced and efficient resource utilization across the cluster while maintaining internally balanced nodes. The objective is to discourage both global load skewness and local resource fragmentation, encouraging cooperative behavior among scheduling agents. The reward is computed centrally by the Scheduler Broker after each scheduling decision and broadcast to all agents.

Formally, the reward is defined as:

R = (1 - \frac{σ_{cpu} + σ_{mem} + σ_{stg}}{3}) + \frac{1}{N} \sum_{i = 1}^{N} (1 - | u_{i}^{c p u} - u_{i}^{m e m} |)

(5)

where:

$σ_{cpu}, σ_{mem}, σ_{stg}$ denote the standard deviation of CPU, memory, and storage utilization ratios across all nodes; the second term denotes the internal balance score of node i, reflecting how evenly its CPU and memory resources are utilized;
resource utilization ratios ( $u_{i}^{c p u}$ and $u_{i}^{m e m}$ ) are computed as the fraction of allocated resources over total available capacity for each node;
N denotes the total number of nodes in the cluster.

The first term incentivizes uniform utilization across the cluster by penalizing high variance in per-node resource usage, thus mitigating hotspot formation and long-term imbalance. The second term promotes balanced consumption of CPU and memory within individual nodes, reducing fragmentation and improving future placement flexibility.

This formulation relies exclusively on low-level resource signals that are directly observable by the scheduler and agents, making it suitable for decentralized learning and scalable deployment. While effective for infrastructure-level balancing, it does not yet incorporate higher-level objectives such as task latency, data locality, or network conditions. Extending the reward to capture such factors is left for future work, particularly in data-intensive and latency-sensitive cloud–edge deployments.

To provide an initial indication of the feasibility and learning behavior of DARO, we report preliminary results obtained from training distributed RL agents in simulated Kubernetes clusters [50] using the reward function defined in Section 7.3. In this setup, up to 60 worker nodes were equipped with distributed MARL agents and trained over multiple scheduling episodes under stable workload conditions. Figure 14 illustrates the evolution of the mean episodic return, defined as the cumulative reward over an episode. The observed upward trend and subsequent stabilization indicate that the agents progressively learn scheduling policies that improve cluster-wide resource balance and node-level utilization over complete decision horizons. While these results are not intended as a comprehensive performance evaluation, they demonstrate the practical viability of the proposed learning formulation and confirm that the reward function provides a meaningful optimization signal. More extensive experiments, including comparisons against heuristic schedulers and larger heterogeneous deployments, are planned as future work.

7.4. Integration with the Proposed Architecture

DARO is integrated into the broader proposed architecture, benefiting from the outputs of several supporting modules:

Higher-level abstractions (see Section 4) and mechanisms (see Section 5) provide semantic context about applications and resources to support informed and constraint-aware scheduling decisions.
Real-time system state estimation (see Section 6.1) improves local agent decision-making by providing accurate, up-to-date resource metrics.
Logical node clustering (see Section 6.3) informs how the nodes should coordinate and manage data locality in the infrastructure.

Together, these modules form a closed-loop control system that begins with system monitoring, proceeds through logical clustering, task placement, and resource allocation, and continuously adapts to changing workload patterns and infrastructure conditions.

8. Decentralized Trust and Security Framework

The described security framework is designed to implement a fully decentralized trust and security model by tightly integrating DLT [51] with the principles of Zero-Trust [27]. The architecture combines Hyperledger Fabric, the Extensible Access Control Markup Language (XACML) [52], distributed policy enforcement points, and an application-level encryption layer provided through EncryptFlow. Hyperledger Fabric, which operates as a permissioned blockchain, establishes the foundational trust layer by ensuring verifiable participant identities, immutable storage of access control policies, and decentralized credential management via Decentralized Identifiers (DIDs) [53]. Its modular architecture supports flexible deployments across multiple organizations and provides fault-tolerant consensus, thus reducing the exposure and trust limitations typically associated with public blockchains. The general structure and interaction of these components are illustrated in Figure 15, which provides an overview of the architecture.

In addition to this secure ledger foundation, XACML introduces a dynamic and context-sensitive attribute-based access control (ABAC) mechanism [54]. In contrast to static role-based models, ABAC evaluates access requests in real-time using multiple attributes, contextual parameters, and risk indicators [55]. Storing both policies and attributes directly on the blockchain ensures that access control decisions remain traceable, tamper-resistant, and consistently enforced across the system. This capability is realized through a coordinated network of specialized nodes: Hyperledger Fabric nodes manage decentralized identities and preserve immutable policies, XACML nodes serve as decision points for real-time authorization, and PEP-Proxy nodes act as distributed enforcement points that intercept requests and verify their validity before granting access.

To extend the Zero-Trust paradigm beyond access control and into data protection, the framework integrates EncryptFlow [56], a portable end-to-end encryption and decentralized key management solution. EncryptFlow ensures that sensitive data remains encrypted throughout its life cycle, from generation to authorized processing, while assigning full control of key management to the data owner. This approach protects confidentiality and integrity, aligning with the same decentralized verification-first principles used for identity and access management, thus creating a cohesive and unified trust model across all layers of the architecture.

The choice of a DLT-based trust layer is motivated by the limitations of conventional security mechanisms in federated cloud–edge–IoT environments. While PKI and mTLS-based approaches provide efficient point-to-point authentication, they typically assume centralized certificate authorities and relatively static trust relationships. These assumptions become increasingly difficult to sustain in multi-stakeholder and cross-domain deployments, where identities, policies, and administrative control are distributed across organizational boundaries. In contrast, a permissioned distributed ledger enables a shared and tamper-resistant trust substrate that maintains consistently verifiable identities and authorization policies across participating domains [57].

8.1. Implementation of DLT-Based Security

The practical implementation of this design involves deploying a DLT-based security architecture to provide secure identity management and access control in decentralized environments [58]. Hyperledger Fabric serves as the backbone for storing identity records, access policies, and attributes, offering a tamper-proof and auditable ledger. In parallel, XACML operates as the policy decision point, evaluating each authorization request against the policies and contextual data recorded on the blockchain. These authorization decisions are enforced in real-time by PEP-Proxy components, which act as distributed entry points within the architecture. Each PEP-Proxy intercepts incoming access requests, verifies the validity of the access token included in the request, and grants or denies access accordingly. Together, these components remove the dependency on centralized intermediaries and ensure operational alignment with Zero-Trust principles.

From a protocol perspective, authorization in the proposed architecture follows a clear separation between decision-making and enforcement responsibilities. XACML operates as the policy decision point (PDP), while distributed PEP-Proxy components enforce access decisions at the application boundary. Interactions with the ledger rely on Fabric’s standard transaction flow and are exposed to the security components through a REST API layer, avoiding the introduction of custom network security protocols and enabling a direct mapping between (i) identity and credential operations, (ii) policy and attribute management, and (iii) audit logging smart contracts.

To enable these capabilities, two dedicated smart contracts are deployed in Hyperledger Fabric. The first manages policies and attributes through an indexed JSON structure, allowing for registration, version-controlled updates, and retrieval by domain or type. This structure ensures that authorization always relies on the most current rules while preserving historical versions for audit and compliance purposes. The second smart contract provides immutable logging of authorization requests, supporting historical analysis and forensic investigation while preventing any modification of stored records. Both smart contracts are accessed via a REST API, which offers a standardized interface for XACML and other security components to interact with the ledger. This separation of policy logic from direct blockchain interaction improves scalability, simplifies integration, and facilitates interoperability between different components.

EncryptFlow functions as an additional cryptographic enforcement layer, providing encryption at the source, controlled decryption at authorized processing points, and decentralized key management under the control of the data owner. When sensitive data is generated, it is encrypted using key material provided by the owner’s key manager. Any entity requiring access must submit a formal decryption request, which is evaluated against the policies stored on the blockchain. Approved requests result in the delivery of wrapped decryption keys, while both successful and rejected requests are authenticated via DIDs and recorded in an immutable audit log. By combining immutable policy management, context-aware ABAC decision-making, and user-controlled encryption, the implementation ensures that access rights and data confidentiality are preserved within a transparent, verifiable, and decentralized framework.

To conclude this section, it is essential to clarify how the proposed decentralized trust layer integrates with Kubernetes’ native access control mechanisms. The DLT-based Zero-Trust framework does not replace Kubernetes’ built-in security features; instead, it complements them. Kubernetes continues to enforce authentication and authorization within the cluster, while the blockchain layer provides verifiable identity, immutable policy storage, and policy consistency across heterogeneous environments—capabilities that extend beyond traditional cluster-scoped enforcement.

Although this additional verification introduces an extra authorization step, the associated overhead is limited in practice, as policy evaluation and ledger interaction are primarily triggered during identity validation and access authorization events, rather than during every data-plane interaction or scheduling decision. As a result, the proposed approach avoids per-message verification costs and preserves the expected performance characteristics of Kubernetes control-plane operations (e.g., scheduler latency and API responsiveness), while enabling decentralized trust and secure interoperability across heterogeneous environments.

8.2. Operational Workflow of Zero-Trust Security

The process starts with the establishment of trusted identities for all participating nodes. This onboarding process assumes a permissioned Fabric network in which participating organizations are registered via their Membership Service Providers (MSPs) and issue cryptographic identities through Certificate Authorities (CAs). Under these assumptions, only authenticated participants can register DIDs, obtain verifiable credentials, and interact with protected resources, aligning the trust layer with Zero-Trust principles. Within each node, the component known as the Holder is responsible for generating and storing the Decentralized Identifier (DID) locally. This process begins with the creation of a cryptographic key pair using the ES256K algorithm, followed by the construction of a DID document that contains identity metadata and a verification method to retrieve the public key. Once generated, the DID is also registered in Hyperledger Fabric, ensuring both immutability and verifiable ownership. The diagram illustrating the DID creation and registration process is shown in Figure 16. In parallel, the node acquires a Verifiable Credential either at deployment or when first requesting access to a protected resource. This issuance involves a secure, multi-step interaction with an issuer, resulting in a signed credential that will be presented in future authorization requests.

Once identities and credentials are in place, the management of access policies and attributes follows a consistent blockchain-backed process. XACML submits a JSON policy or attribute set via the REST API, using the identifier as the blockchain key to guarantee uniqueness and immutability. Updates follow a version-controlled model in which the timestamp and version number are incremented while the identifier remains constant, thereby preserving a full change history. Retrieval can be scoped to specific domains or performed globally, ensuring that enforcement components always operate with the latest relevant policy data. Authorization itself proceeds in a structured manner. When a resource request is made, the verifier sends the resource, subject, and action parameters to XACML, which compares them with the policies stored in the ledger. If the policy decision is “Permit”, the verifier generates a signed access token. This token is then validated by the PEP-Proxy before granting access, ensuring that each request is explicitly verified and that any unauthorized attempt is blocked at the entry point.

Data protection workflows integrate seamlessly with this process through EncryptFlow. A task that produces sensitive data encrypts it with keys obtained from the owner’s key manager, after which the encrypted data can be stored or transmitted without risk of exposure. If another task needs to process this data, it creates a share request embedded within the encrypted content and sends it to the key manager. The request is evaluated according to the access policies, and if approved, the requester receives the wrapped decryption keys. All exchanges are logged in detail, creating an immutable audit trail that supports compliance verification and dispute resolution. Through this tightly integrated sequence of steps, our proposed architecture maintains a decentralized, verifiable, and context-sensitive security process that protects both the identity and data layers, ensuring integrity, confidentiality, and trust across the entire platform.

Failure handling within the DLT-based trust layer builds on the resilience mechanisms provided by the underlying permissioned blockchain. Identity records, access control policies, and audit information are maintained across multiple ledger nodes, allowing the system to tolerate individual node outages without immediate disruption to security operations. In the event that a ledger peer becomes temporarily unavailable, other participating peers can continue processing requests, and the failed node may rejoin the network once connectivity is restored.

Ordering service availability is supported through replicated orderer components so that transaction processing can continue as long as sufficient ordering nodes remain active. From the perspective of security enforcement components (e.g., XACML or PEP-Proxy), temporary ledger unavailability may delay the propagation of policy updates, while access decisions continue to rely on the most recently committed policy state until normal ledger operation resumes. This approach helps maintain consistent authorization behavior during partial DLT outages [59].

9. Discussion and Future Work

In this section, we will present additional reflections on the various aspects of the proposed architecture discussed in this article. Although numerous cloud–edge–IoT orchestration systems have emerged over the past decade, including KubeEdge, Open Horizon, and hybrid cloud frameworks, these solutions typically address isolated parts of the continuum stack. Existing systems provide device connectivity, basic offloading, or edge-aware scheduling, but do not offer a unified approach that spans heterogeneous device abstraction, decentralized monitoring, autonomous logical cluster formation, learning-enabled scheduling, and Zero-Trust security. The previously introduced architecture argues that genuine autonomy in the computing continuum can only emerge from the integration of these capabilities into a single coherent architecture. This architectural convergence distinguishes this architecture from prior approaches that have focused narrowly on scheduling strategies, IoT integration, or Kubernetes edge extensions, without addressing the full end-to-end orchestration pipeline.

It is essential to explain why this architecture can support a broader range of complex and multi-modal applications and workloads. First, the architecture consists of heterogeneous computational resources, including diverse edge devices (e.g., Android and Linux-based devices), which have been identified as extremely useful but introduce very interesting design and implementation challenges. In general, container orchestration engines are used to set up cloud services and manage computational and networking resources. Container orchestration engines automate the deployment of applications, provide scaling mechanisms and automation when needed, and offer tools and interfaces for managing any containerized application. Moreover, load balancing between services can be automated or enforced, and in general, it is easy to apply additional networking policies for every specified cloud service. However, we have to bear in mind that, while decentralization and learning-based decision-making improve adaptability and resilience, they also introduce trade-offs in terms of communication overhead, energy consumption on constrained devices, and convergence time, considerably influenced by deployment scale and workload dynamics.

9.1. Autonomy and Adaptability

Our proposed architecture aims to unify diverse and multi-modal devices under a Kubernetes-native orchestration ecosystem. As discussed previously, this is only partially achieved by other existing orchestration systems, such as KubeEdge or OpenShift. Our proposed architecture incorporates additional tools that either extend the native capabilities of Kubernetes (e.g., Hypertool, Open Connectors) or adopt the Kubernetes principles but are redesigned to add more features and technologies (e.g., DARO scheduler, self-monitoring, DLT trust mechanisms). The latter tools are not natively provided by Kubernetes but operate by adopting the same principles of orchestration, scalability, compatibility between multi-modal nodes, and security.

The proposed architecture is capable of self-adapting and self-managing all workloads and resources in various dynamic large-scale environments. In Section 2, we briefly presented many existing orchestration platforms, including Kubernetes, which rely heavily on predefined scaling and resource utilization policies. These mechanisms are not efficient when applied to highly heterogeneous infrastructures, which contain multi-modal nodes with computing incompatibilities, especially when energy and networking limitations are present. Our proposed architecture was initially designed to address all these challenges by creating abstraction layers and embedding intelligence in each layer. The proposed architecture offers a decentralized self-awareness mechanism (i.e., Monitoring and Swarm Formation) that, in conjunction with the DARO scheduler, facilitates achieving autonomy and resilience across heterogeneous environments.

The Decentralized Self-Awareness mechanisms provide tools for every node to monitor itself, detect any anomalies that occur, and exchange its state with other nodes. The entire cluster is capable of maintaining accurate self-situational awareness without relying on a central monitoring service (e.g., Prometheus). Using AI-driven anomaly detection and autoencoder-based full-state estimation, nodes can manage performance issues, detect bottlenecks, and adapt accordingly. Communication overhead is minimized, and all correcting actions are performed in real time.

Another layer is the Swarm Formation mechanism, which enables nodes to dynamically form logical clusters, adapting to specific application requirements. The swarms are created to provide sub-clusters to the users. They are capable of reconfiguring themselves in response to any failure or change in resource availability. Efficient resource aggregation and reliability are enhanced by this adaptive clustering technique, which can be applied to various variable workloads.

The Distributed and Adaptive Resource Optimization (DARO) framework further reinforces adaptivity at the scheduling layer. DARO is a distributed, multi-agent reinforcement learning (MARL) approach for scheduling applications, which enables nodes to learn over time and improve their scheduling policies based on resource and other requirements. Although each agent acts and learns independently, it contributes to a globally efficient scheduling policy. As time passes, the system converges towards optimal task placement strategies and decisions that further account for latency, data locality, and resource balance. The interaction between self-awareness, swarm formation, and MARL scheduling creates a closed-loop adaptive system capable of continuous self-optimization.

All of the above layers provide a very strong and robust foundation for autonomy and adaptability towards deployed workloads. However, maintaining the stability and interoperability of such autonomous decision-making processes is always challenging and remains a direction for future research.

9.2. Scalability, Interoperability and Security Considerations

The proposed architecture aims to enhance scalability across heterogeneous devices (mentioned in a previous section as Device Nodes). Through the Resource and Abstraction layer in our architectural hierarchy, we provide a unified representation of cloud, edge, and IoT resources, enabling uniform orchestration across edge device classes that are otherwise incompatible. The introduction of Open Connectors further extends interoperability, allowing non-container-native devices (e.g., Android OS-based devices) to operate as Kubernetes cluster nodes using Custom Resource Definitions (CRDs). This significantly broadens the applicability of containerized orchestration to diverse ecosystems, including wearables, AR devices, industrial IoT sensors, and other edge devices.

The above requires additional security mechanisms to be enforced, which can be accomplished with the Distributed Ledger Technology (DLT) framework. This framework enforces identity management, access control, and auditability in a tamper-proof and Zero-Trust manner. Confidentiality and traceability are improved, despite some computational and storage overhead being introduced.

9.3. Limitations

Despite its promising potential, the proposed architecture also introduces several design and implementation challenges. Firstly, embedding distributed intelligence in every node increases the system’s complexity and may lead to higher energy consumption on constrained devices. Second, the MARL-based scheduling framework, while scalable in principle, requires careful tuning to prevent excessive network communication and ensure convergence under dynamic conditions. Third, integration of DLT mechanisms may cause latency overhead in time-sensitive operations, especially when dealing with large-scale geographically distributed deployments. Furthermore, interoperability testing across diverse network configurations, varying communication protocols, and different operating systems represents a non-trivial engineering effort that requires systematic validation.

9.4. Future Work

The architecture presented provides a foundation for a new generation of self-managing and intelligent orchestration frameworks that can efficiently bridge between heterogeneous and distributed infrastructures. It is by integrating decentralized intelligence, adaptive scheduling, and verifiable trust that we lay the foundation for an autonomous computing continuum, where nodes and applications participate in a dynamic collaboration to negotiate resources and adapt to contextual changes in real-time. This model opens up the path towards cognitive orchestration systems, i.e., platforms that not only deploy and scale applications but also constantly monitor performance, energy efficiency, or data locality to guarantee optimal execution without intervention from a human operator. In doing so, our proposed architecture provides not only a technical roadmap for distributed orchestration using AI but also a systematic model for self-evolving computing environments that can sustain promising paradigms, such as AI-driven networking and federated intelligence, with regard to context-adaptive services across the full computing continuum.

Future research will focus on evaluating and validating our proposed architecture in real-world scenarios. We aim to deploy our orchestration platform in hybrid cloud infrastructures, which also contain edge and IoT environments and devices. Current implementations have been deployed within the scope of a European-funded research project named HYPER-AI [60]. All of the above layers (and not only) are currently under development, and a first version of them has been tested and open-sourced [61]. The project’s goal is to provide a unified container orchestration platform with all the proposed features. Specific use cases have been agreed upon to test and stress this new orchestration platform in several real-world settings and industries, including farming and agriculture, mobility and automotive, green energy, healthcare, and Industry 4.0.

10. Conclusions

To facilitate the smooth orchestration and execution of workloads across textredacross heterogeneous and distributed environments, this paper presents a native unified Kubernetes architecture. By extending the conventional cloud orchestration paradigm, the suggested framework creates a distributed, self-governing, and trust-aware ecosystem that can dynamically adjust to changing infrastructure conditions and diverse device capabilities. The proposed architecture provides an integrated solution that bridges the gap between cloud-native and edge-native computing paradigms through abstraction layers and components, including the Resource and Application Abstraction Layer, Open Connectors, Decentralized Self-Awareness and Swarm Formation mechanisms, the Distributed and Adaptive Resource Optimization (DARO) scheduler, and the DLT-based Zero-Trust Security framework.

For nodes to self-monitor, cooperate, and collectively optimize resource allocation without centralized control, the architecture distinguishes itself by integrating intelligence and autonomy at multiple levels. This design improves responsiveness, scalability, and fault tolerance in settings with varying computational resources, limited connectivity, and volatility. A verifiable trust foundation is also provided by the integration of distributed ledger technologies and attribute-based access control, which guarantees the decentralized and impenetrable maintenance of data integrity, identity management, and authorization.

The proposed architecture creates a robust framework for overseeing the next generation of complex, AI-driven, and multimodal applications through the integration of adaptability, decentralization, and trust. The proposed architecture offers a collection of innovative mechanisms and a unified design philosophy that transforms the perception and orchestration of computing resources across boundaries. Future efforts will encompass experimental validation and performance benchmarking in extensive testbeds, the enhancement of learning-based scheduling algorithms, and the investigation of lightweight blockchain architectures to optimize security and efficiency.

We aim to establish the foundation for autonomous, scalable, and reliable orchestration in hyper-distributed environments, representing progress toward achieving a genuinely intelligent computing continuum.

Author Contributions

Conceptualization, V.P., S.H., H.H., K.I., I.T.M., R.M.P., A.M. and J.C.; methodology, N.D., V.P., M.L., M.T., S.M.U.H., H.H., T.M., E.B., F.J.R.M., A.M., J.C. and P.S.; software, N.D., V.P., M.L., M.T., S.M.U.H., H.H., T.M., E.B., F.J.R.M., A.M., J.C. and P.S.; validation, all authors; formal analysis, all authors.; investigation, all authors; resources, all authors; data curation, all authors; writing—original draft preparation, N.D., V.P., M.L., M.T., S.M.U.H., H.H., T.M., E.B., F.J.R.M., A.M., J.C. and P.S.; writing—review and editing, all authors; visualization, all authors; supervision, N.D., H.H. and T.M.; project administration, I.T.M. and E.K.; funding acquisition, V.P., S.H., H.H., K.I., I.T.M., R.M.P., A.M. and J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by the HYPER-AI project, funded by the European Commission under Grant Agreement 101135982 through the Horizon Europe research and innovation program (https://hyper-ai-project.eu/, accessed on 10 November 2025).

Data Availability Statement

Data sharing is not applicable to this article as no new datasets were generated or analyzed in this study.

Conflicts of Interest

Author Amr Mousa was employed by the Virtual Vehicle Research GmbH. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

The Kubernetes Authors. Kubernetes: Production-Grade Container Orchestration. Available online: https://kubernetes.io/ (accessed on 13 November 2025).
Docker Inc. Swarm Mode—Docker Engine. Available online: https://docs.docker.com/engine/swarm/ (accessed on 13 November 2025).
Red Hat Inc. Red Hat OpenShift—Develop, Modernize, and Deploy Applications at Scale on a Trusted and Consistent Platform. Available online: https://www.redhat.com/en/technologies/cloud-computing/openshift (accessed on 13 November 2025).
Apache Software Foundation. Apache Mesos. Available online: https://mesos.apache.org/ (accessed on 13 November 2025).
Al Jawarneh, I.; Bellavista, P.; Bosi, F.; Foschini, L.; Martuscelli, G.; Montanari, R.; Palopoli, A. Container orchestration engines: A thorough functional and performance comparison. In Proceedings of the ICC 2019—IEEE International Conference on Communications, Shanghai, China, 20–24 May 2019; pp. 1–6. [Google Scholar] [CrossRef]
Ullah, A.; Kiss, T.; Kovács, J.; Tusa, F.; Deslauriers, J.; Dagdeviren, H.; Arjun, R.; Hamzeh, H. Orchestration in the cloud-to-things compute continuum: Taxonomy, survey and future directions. J. Cloud Comput. 2023, 12, 135. [Google Scholar] [CrossRef]
Gkonis, P.; Giannopoulos, A.; Trakadas, P.; Masip-Bruin, X.; D’Andria, F. A survey on IoT-edge-cloud continuum systems: Status, challenges, use cases, and open issues. Future Internet 2023, 15, 383. [Google Scholar] [CrossRef]
Shames, P.; Sarrel, M. A modeling pattern for layered system interfaces. In Proceedings of the 25th Annual INCOSE International Symposium, Seattle, WA, USA, 13–16 July 2015. [Google Scholar]
Amazon Web Services Inc. Amazon Elastic Kubernetes Service (EKS)—Build, Run, and Scale Production-Grade Kubernetes Applications. Available online: https://aws.amazon.com/eks/ (accessed on 13 November 2025).
Microsoft Corporation. Azure Kubernetes Service (AKS)—Deploy and Scale Containers on Managed Kubernetes. Available online: https://azure.microsoft.com/en-us/products/kubernetes-service (accessed on 13 November 2025).
HashiCorp Inc. Nomad: A Simple and Flexible Scheduler and Orchestrator. Available online: https://developer.hashicorp.com/nomad (accessed on 13 November 2025).
The OpenStack Project. OpenStack—Open Source Cloud Computing Infrastructure. Available online: https://www.openstack.org/ (accessed on 13 November 2025).
KubeEdge Project Authors. KubeEdge: A Kubernetes Native Edge Computing Framework. Available online: https://kubeedge.io/ (accessed on 13 November 2025).
Kephart, J.O.; Chess, D.M. The vision of autonomic computing. Computer 2003, 36, 41–50. [Google Scholar] [CrossRef]
Herbst, N.; Becker, S.; Kounev, S.; Koziolek, H.; Maggio, M.; Milenkoski, A.; Smirni, E. Metrics and benchmarks for self-aware computing systems. In Self-Aware Computing Systems; Springer: Berlin/Heidelberg, Germany, 2017; pp. 437–464. [Google Scholar]
Satyanarayanan, M. The emergence of edge computing. Computer 2017, 50, 30–39. [Google Scholar] [CrossRef]
Salehie, M.; Tahvildari, L. Self-adaptive software: Landscape and research challenges. ACM Trans. Auton. Adapt. Syst. (TAAS) 2009, 4, 1–42. [Google Scholar] [CrossRef]
Babaoglu, O.; Jelasity, M. Self-* properties through gossiping. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2008, 366, 3747–3757. [Google Scholar] [CrossRef] [PubMed][Green Version]
S-Julián, R.; Lacalle, I.; Vaño, R.; Boronat, F.; Palau, C.E. Self-* capabilities of cloud-edge nodes: A research review. Sensors 2023, 23, 2931. [Google Scholar] [CrossRef] [PubMed]
Tsokov, T.; Kostadinov, H. Kinitos: Dynamic network-aware scheduling and descheduling in Kubernetes clusters with mobile nodes. J. Netw. Comput. Appl. 2025, 238, 104157. [Google Scholar] [CrossRef]
Ali, B.; Golec, M.; Gill, S.S.; Wu, H.; Cuadrado, F.; Uhlig, S. EdgeBus: Co-Simulation based resource management for heterogeneous mobile edge computing environments. Internet Things 2024, 28, 101368. [Google Scholar] [CrossRef]
Ma, K.; Xu, L. Energy-conscious scheduling in edge environments: Hybridization of traditional control and DE algorithm. Front. Robot. AI 2025, 12, 1656516. [Google Scholar] [CrossRef]
Wang, X.; Zhao, K.; Qin, B. Optimization of Task-Scheduling Strategy in Edge Kubernetes Clusters Based on Deep Reinforcement Learning. Mathematics 2023, 11, 4269. [Google Scholar] [CrossRef]
Ghafouri, S.; Abdipoor, S.; Doyle, J. Smart-Kube: Energy-Aware and Fair Kubernetes Job Scheduler Using Deep Reinforcement Learning. In Proceedings of the 8th IEEE International Conference on Smart Cloud (SmartCloud), Tokyo, Japan, 16–18 September 2023. [Google Scholar]
Zhang, W.; Ou, H. Reinforcement learning based multi-objective task scheduling for energy efficient and cost effective cloud-edge computing. Sci. Rep. 2025, 15, 41716. [Google Scholar] [CrossRef] [PubMed]
Roman, R.; Lopez, J.; Mambo, M. Mobile edge computing, fog et al.: A survey and analysis of security threats and challenges. Future Gener. Comput. Syst. 2018, 78, 680–698. [Google Scholar] [CrossRef]
Rose, S.; Borchert, O.; Mitchell, S.; Connelly, S. Zero Trust Architecture; Technical Report SP 800-207; National Institute of Standards and Technology (NIST): Gaithersburg, MD, USA, 2020. [Google Scholar] [CrossRef]
Fernández-Caramés, T.M.; Fraga-Lamas, P. A review on the use of blockchain for the Internet of Things. IEEE Access 2018, 6, 32979–33001. [Google Scholar] [CrossRef]
Cyfuture Cloud. Specifications for Minimum and Maximum Node Sizes in a Kubernetes Cluster. Available online: https://cyfuture.cloud/kb/kubernetes/specifications-for-minimum-and-maximum-node-sizes-in-a-kubernetes-cluster (accessed on 13 November 2025).
The Kubernetes Authors. Kubernetes Custom Resources. Available online: https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/ (accessed on 13 November 2025).
Weerasiri, D.; Barukh, M.; Benatallah, B.; Sheng, Q.; Ranjan, R. A taxonomy and survey of cloud resource orchestration techniques. ACM Comput. Surv. 2017, 50, 26. [Google Scholar] [CrossRef]
Van Woensel, W.; Abibi, S. Optimizing and benchmarking OWL2 RL for semantic reasoning on mobile platforms. Semant. Web 2019, 10, 637–663. [Google Scholar] [CrossRef]
The Kubernetes Authors. Kubernetes Controllers: Ensuring Desired State in Distributed Systems. Available online: https://kubernetes.io/docs/concepts/architecture/controller/ (accessed on 13 November 2025).
Eclipse Foundation. Eclipse Mosquitto: An Open-Source MQTT Broker. Available online: https://mosquitto.org/ (accessed on 13 November 2025).
The Kubernetes Authors. Kubernetes API Overview. Available online: https://kubernetes.io/docs/concepts/overview/kubernetes-api/ (accessed on 13 November 2025).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef]
Kingma, D.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2022, arXiv:1312.6114. [Google Scholar] [CrossRef]
Darban, Z.Z.; Webb, G.I.; Pan, S.; Aggarwal, C.C.; Salehi, M. CARLA: Self-supervised contrastive representation learning for time series anomaly detection. Pattern Recognit. 2025, 157, 110874. [Google Scholar] [CrossRef]
Miranda, V.; Krstulovic, J.; Keko, H.; Moreira, C.; Pereira, J. Reconstructing missing data in state estimation with autoencoders. IEEE Trans. Power Syst. 2011, 27, 604–611. [Google Scholar] [CrossRef]
Ford, B.; Srisuresh, P.; Kegel, D. Peer-to-peer communication across network address translators. In Proceedings of the USENIX Annual Technical Conference, General Track, Anaheim, CA, USA, 10–15 April 2005; pp. 179–192. [Google Scholar]
Eclipse Research Labs/HYPER-AI Project. Self-Awareness-Pipeline. 2025. Available online: https://gitlab.eclipse.org/eclipse-research-labs/hyper-ai-project/self-awareness-pipeline (accessed on 25 February 2025).
Liu, J.; Ding, Y.; Liu, Y. A balanced leader election algorithm based on replica distribution in Kubernetes cluster. Clust. Comput. 2024, 27, 7241–7250. [Google Scholar] [CrossRef]
Floudas, C.; Lin, X. Mixed integer linear programming in process scheduling: Modeling, algorithms, and applications. Ann. Oper. Res. 2005, 139, 131–162. [Google Scholar] [CrossRef]
Vayghan, L.A.; Saied, M.A.; Toeroe, M.; Khendek, F. A Kubernetes controller for managing the availability of elastic microservice based stateful applications. J. Syst. Softw. 2021, 175, 110924. [Google Scholar] [CrossRef]
Eclipse Research Labs/HYPER-AI Project. K8s Marl Training. 2025. Available online: https://gitlab.eclipse.org/eclipse-research-labs/hyper-ai-project/k8s-marl-training (accessed on 9 July 2025).
Albrecht, S.; Christianos, F.; Schäfer, L. Multi-Agent Reinforcement Learning: Foundations and Modern Approaches; MIT Press: Cambridge, MA, USA, 2024. [Google Scholar]
Hady, M.; Hu, S.; Pratama, M.; Cao, J.; Kowalczyk, R. Multi-Agent Reinforcement Learning for Resources Allocation Optimization: A Survey. arXiv 2025, arXiv:2504.21048. [Google Scholar] [CrossRef]
Roberts, J.; Nair, V. Deep Reinforcement Learning for Adaptive Resource Allocation in Cloud Computing. Int. J. Adv. Electr. Comput. Eng. 2025, 13, 9–15. [Google Scholar]
Malipatil, A.; Paramasivam, M.; Gulyamova, D.; Saravanan, A.; Ramesh, J.; Muniyandy, E.; Ghodhbani, R. Energy-Efficient Cloud Computing Through Reinforcement Learning-Based Workload Scheduling. Int. J. Adv. Comput. Sci. Appl. 2025, 16, 645. [Google Scholar] [CrossRef]
Eclipse Research Labs/HYPER-AI Project. K8s Workload Simulator. 2025. Available online: https://gitlab.eclipse.org/eclipse-research-labs/hyper-ai-project/k8s-workload-simulator (accessed on 16 January 2025).
Soltani, R.; Zaman, M.; Joshi, R.; Sampalli, S. Distributed Ledger Technologies and Their Applications: A Review. Appl. Sci. 2022, 12, 7898. [Google Scholar] [CrossRef]
OASIS Standard. eXtensible Access Control Markup Language (XACML) Version 3.0. Technical Report, OASIS, 2013. Available online: https://docs.oasis-open.org/xacml/3.0/xacml-3.0-core-spec-os-en.html (accessed on 13 November 2025).
Sporny, M.; Longley, D.; Markus, S.; Reed, D.; Steele, O.; Allen, C. Decentralized Identifiers (DIDs) v1.0. Available online: https://www.w3.org/TR/did-core/ (accessed on 13 November 2025).
Picard, N.; Colin, J.N.; Zampunieris, D. Context-aware and attribute-based access control applying proactive computing to IoT system. In Proceedings of the IoTBDS 2018: 3rd International Conference on Internet of Things, Big Data and Security, Funchal, Portugal, 19–21 March 2018. [Google Scholar]
Ghaffari, F.; Abbasinezhad-Mood, D. Distributed Ledger Technologies for Authentication and Access Control: A Survey. Comput. Networks 2023, 224, 109593. [Google Scholar] [CrossRef]
Yang, M.; Xie, D.; Zhang, G.; Chen, F.; Wang, T.; Hu, P. EncryptFlow: Efficient and Lossless Image Encryption Network Based on Normalizing Flows. IEEE Trans. Artif. Intell. 2025, 6, 3377–3390. [Google Scholar] [CrossRef]
Punia, A.; Kumar, N.; Laouiti, A.; Martin, S. Blockchain-Based Distributed Trust Management in IoT and IIoT: A Survey. J. Supercomput. 2024, 80, 21867–21919. [Google Scholar] [CrossRef]
Eclipse Research Labs/HYPER-AI Project. DLT-Hyperledger-Fabric. 2025. Available online: https://gitlab.eclipse.org/eclipse-research-labs/hyper-ai-project/dlt-hyperledger-fabric (accessed on 25 February 2025).
Androulaki, E.; Barger, A.; Bortnikov, V.; Cachin, C.; Christidis, K.; De Caro, A.; Enyeart, D.; Ferris, C.; Laventman, G.; Manevich, Y.; et al. Hyperledger Fabric: A Distributed Operating System for Permissioned Blockchains. In Proceedings of the Thirteenth EuroSys Conference, Porto, Portugal, 23–26 April 2018; pp. 1–15. [Google Scholar] [CrossRef]
HYPER-AI. Hyper-Distributed Artificial Intelligence Platform. Available online: https://cordis.europa.eu/project/id/101135982 (accessed on 13 November 2025).
HYPER-AI. HYPER-AI Project GitLab. Available online: https://gitlab.eclipse.org/eclipse-research-labs/hyper-ai-project (accessed on 13 November 2025).

Figure 1. Layered architecture of the DACCA framework.

Figure 2. Data model for architecture cluster nodes.

Figure 3. Native Nodes and Device Nodes extended data model.

Figure 4. Application Profile data model.

Figure 5. Open Connectors architecture and components.

Figure 6. C Connectors sequence diagram.

Figure 7. Macro-accuracy of reconstructed node states for different embedding dimensions (k = 3, 4, 5), averaged over all evaluation timesteps in a synthetic heterogeneous setup.

Figure 8. Self-Awareness and Full-State Estimation in Kubernetes architecture.

Figure 9. Number of selected nodes per logical cluster for the proposed MILP-based swarm formation compared to heuristic and Kubernetes-like baselines under synthetic workloads.

Figure 10. Swarm formation decision time for the proposed MILP-based approach under increasing numbers of candidate nodes in a synthetic setup.

Figure 11. Logical cluster formation in Kubernetes architecture.

Figure 12. DARO components within the Kubernetes architecture.

Figure 13. DARO-integrated scheduling workflow in Kubernetes.

Figure 14. Preliminary DARO training results illustrating the evolution of the mean episodic return (cumulative reward) during MARL training in a small-scale Kubernetes cluster.

Figure 15. Architecture for decentralized data trust and security based on DLT.

Figure 16. DID generation and registration process.

Table 1. Structured comparison of DACCA with representative orchestration platforms.

System	Cloud	Edge	IoT/Non-Container	Decentralized Control	Adaptive Scheduling	Zero-Trust Security	Intended Scope
Kubernetes	Yes	Limited	No	No	Rule-based	No	Cloud-native container orchestration
Amazon EKS	Yes	Limited	No	No	Rule-based	Partial	Managed cloud Kubernetes
Azure AKS	Yes	Limited	No	No	Rule-based	Partial	Managed cloud Kubernetes
Nomad	Yes	No	No	No	Rule-based	No	Lightweight cluster scheduling
OpenStack	Yes	Limited	No	No	Rule-based	Partial	Cloud infrastructure management
OpenShift	Yes	Limited	No	No	Rule-based	Partial	Enterprise DevOps and CI/CD
KubeEdge	Yes	Yes	Limited	No	Rule-based	Partial	Cloud-to-edge container orchestration
Docker Swarm	Yes	No	No	No	Rule-based	No	Simple container clustering
DACCA (proposed)	Yes	Yes	Yes	Yes	Learning-based	Yes	Autonomous cloud–edge–IoT continuum

Table 2. Node conditions and their meanings.

Type	Status	Meaning
Ready	True	Node is healthy and accepting pods.
MemoryPressure	False	Node has sufficient memory.
DiskPressure	False	Node has enough disk space.
PIDPressure	False	Node has sufficient process IDs available.
NetworkUnavailable	False	Node has network connectivity.

Table 3. Node addresses and examples.

Type	Example
InternalIP	192.168.1.10
ExternalIP	35.184.202.50
Hostname	worker-node-1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Deligiannakis, N.; Papataxiarhis, V.; Loukeris, M.; Hadjiefthymiades, S.; Touloupou, M.; Ul Hassan, S.M.; Herodotou, H.; Moustakas, T.; Bampis, E.; Ioannidis, K.; et al. DACCA: Distributed Adaptive Cloud Continuum Architecture. Future Internet 2026, 18, 74. https://doi.org/10.3390/fi18020074

AMA Style

Deligiannakis N, Papataxiarhis V, Loukeris M, Hadjiefthymiades S, Touloupou M, Ul Hassan SM, Herodotou H, Moustakas T, Bampis E, Ioannidis K, et al. DACCA: Distributed Adaptive Cloud Continuum Architecture. Future Internet. 2026; 18(2):74. https://doi.org/10.3390/fi18020074

Chicago/Turabian Style

Deligiannakis, Nektarios, Vassilis Papataxiarhis, Michalis Loukeris, Stathes Hadjiefthymiades, Marios Touloupou, Syed Mafooq Ul Hassan, Herodotos Herodotou, Thanasis Moustakas, Emmanouil Bampis, Konstantinos Ioannidis, and et al. 2026. "DACCA: Distributed Adaptive Cloud Continuum Architecture" Future Internet 18, no. 2: 74. https://doi.org/10.3390/fi18020074

APA Style

Deligiannakis, N., Papataxiarhis, V., Loukeris, M., Hadjiefthymiades, S., Touloupou, M., Ul Hassan, S. M., Herodotou, H., Moustakas, T., Bampis, E., Ioannidis, K., Michailidis, I. T., Vrochidis, S., Kosmatopoulos, E., Romero Martínez, F. J., Marín Pérez, R., Mousa, A., Castellini, J., & Strasser, P. (2026). DACCA: Distributed Adaptive Cloud Continuum Architecture. Future Internet, 18(2), 74. https://doi.org/10.3390/fi18020074

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DACCA: Distributed Adaptive Cloud Continuum Architecture

Abstract

1. Introduction

2. Background and Related Work

2.1. Computing Continuum Orchestration

2.2. Decentralized Self-Awareness

2.3. Edge-Aware and Distributed Scheduling

2.4. Decentralized Trust and Security Mechanisms

3. Architecture Overview

3.1. Kubernetes as a Basis

3.2. Scheduling and Scalability

3.3. Top-Level Architecture

4. Resource and Application Abstraction

4.1. Node Data Model

4.1.1. Native Nodes Model

4.1.2. Device Nodes Model

4.2. Application Profile Model

5. Node Discovery, Management, and Interoperability

5.1. Mechanisms to Register, Discover, and Manage Heterogeneous Nodes

5.2. Cognitive Cloud Softwarized Infrastructure (Connectors)

6. Decentralized Self-Awareness and Swarm Formation

6.1. Decentralized Monitoring and Self-Awareness

6.2. Conditional Autoencoder for State Estimation

6.3. Swarm Formation

7. Distributed and Adaptive Resource Scheduling

7.1. Architectural Components

7.2. Task Scheduling Workflow

7.3. Reward Function and Learning Objective

7.4. Integration with the Proposed Architecture

8. Decentralized Trust and Security Framework

8.1. Implementation of DLT-Based Security

8.2. Operational Workflow of Zero-Trust Security

9. Discussion and Future Work

9.1. Autonomy and Adaptability

9.2. Scalability, Interoperability and Security Considerations

9.3. Limitations

9.4. Future Work

10. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI