1. Introduction
Edge computing allows processing of data closer to its source, aiming to reduce latency, conserve bandwidth and ensure data sovereignty, as (sensitive) information is not resorted to remote systems [
1,
2,
3]. Small, low-powered edge devices often operate in ensembles–such as swarms of robots, drones or groups of IoT devices–distributing computational tasks across multiple nodes rather than relying on specific single units offers several advantages. First, in an edge environment, tasks may exceed the capacity of a single device, require special hardware or benefit from less frequent context switches, which can reduce energy efficiency and performance [
4,
5]. Second, it allows the integration of specialized hardware for executing specific sub-tasks (e.g., accelerator chips for certain portions in a neural network inference). Third, distributed systems provide considerable flexibility and scalability; adding or removing nodes is often more cost-effective than upgrading a single system to meet increasing demands. Lastly, a distributed approach facilitates data throughput by enabling pipelining, i.e., streamlining processing stages to ensure continuous data flow. One use case where all of the aforementioned points can be particularly relevant are modern machine learning applications. Typically, deep neural network architectures, but also classical machine learning algorithms, comprise a vast number of computational parameters. Tiny machine learning systems often handle high volumes of individually small inference elements, such as low-resolution images or time series of sensor data. By distributing the computation across multiple devices [
6], as sketched in
Figure 1, hardware limitations can be alleviated by partitioning the task into steps that align with the capabilities of each device, taking advantage of specialized hardware where possible. Moreover, given that inference elements are individually small compared to the size of the neural network, efficient pipelining can be accomplished.
In dynamic edge networks with mobile, volatile participants, such as, for instance, swarms of autonomous robots or drones, reliable cooperative task processing becomes challenging. Connectivity instabilities caused by technical or environmental factors, nodes, failing due to hardware damage, power loss or (temporary) disconnection from the network due to poor positioning, make the network topology unpredictable. Wireless connectivity can also be disrupted by signal interference or network congestion.
Specific applications for AI-driven data analysis directly on dynamic edge network can be envisioned, such as simultaneous localization and mapping (SLAM) on a fleet of autonomous mobile robots (AMR) [
7,
8,
9,
10], navigation for autonomous swarms of drones, in particular determination of points of interest during flight time beyond the line of sight [
11,
12,
13,
14,
15,
16]. Other use cases include local preprocessing of data on wireless sensor networks (WSN) of massive IoT applications, such as agriculture monitoring [
17,
18,
19], disaster relief and emergency operations, where temporary ad-hoc communication networks are installed when traditional infrastructure is compromised [
20,
21,
22,
23,
24,
25]. Finally, industrial IoT [
26], smart city [
27] or home automation can be further relevant fields for this technology.
Applications within dynamic edge networks, where nodes work cooperatively, must address the challenge of how to maintain reliable, distributed task execution when the set of available devices, their connectivity, and their computational capabilities are constantly shifting. Unlike static clusters, these networks lack stable routing paths and persistent neighbors, meaning that any assignment of computation must be resilient to intermittent links or unreachable nodes. The heterogeneity of hardware further complicates coordination, as nodes differ not only in compute power and in the types of operations they can efficiently support, but–especially in the machine learning regime–also in their ability to handle the substantial memory demands of parameter-heavy models. Consequently, a distributed computing framework is required which ensures that multi-stage machine learning pipelines progress automatically and without interruption and which features mechanisms that can reassess resources continuously, hence being able to react to failures instantly and reconfigure task flows on the fly.
In this manuscript, we present DATOR (Distributed Automatic Task ORchestration), a conceptual algorithmic framework capable of processing sensor data directly within a volatile cluster of heterogeneous edge devices. In particular, inference tasks can be processed in multiple sequential stages in a fully distributed and self-organized manner, without the need for a central orchestration unit and hence without a single point of failure. Broadly speaking, the collective ensemble of multiple edge devices ensures that the system as a whole can continue to operate even when one or more nodes fail or leave the network, or when connections disrupt, by appropriately rerouting tasks. Conversely, the system is designed to automatically adapt when new nodes are added, seamlessly integrating them into the orchestration process. The machine learning inference task is split into distinct, sequential computation steps, which are allocated dynamically across various nodes within the ensemble. Our solution is generic, capable of application to any network comprising multiple nodes engaged in cooperative tasks. However, its lightweight and fully distributed design renders it particularly suitable for low-power mobile edge networks.
2. Related Work
Edge networks that rely on small embedded devices face significant constraints in processing power, memory, storage and battery capacity. These devices are typically designed for simple, specific tasks, limiting the complexity of applications they can execute and the amount of data they can process. This study investigates a connected cluster composed entirely of resource-constrained devices, defining an extreme edge computing environment. This “extreme” context is characterized by, first, the absence of any reliable nodes with considerable performance above the level of low-power microcontroller units, and second, the dominance of mobile, volatile devices, thus demanding a decentralized approach. To the best of our knowledge, no existing orchestration framework explicitly targets these edge networks.
Research in the related domain of mobility-aware edge computing [
28,
29,
30] typically relies on trajectory prediction, usage patterns, or geometric models to proactively migrate services and cache data across nearby devices. These approaches generally assume the presence of a stable backbone—such as edge servers or cloudlets—to orchestrate the handover. In contrast, we consider scenarios with non-deterministic volatile dynamics rather than following predictable patterns. Moreover, our architecture is strictly decentralized; by design, we cannot rely on a hierarchical controller or a stable super-node to maintain a consistent global view of the network state. Consequently, our problem statement requires a reactive, lightweight mechanism capable of functioning locally.
Also, there are conceptual similarities with fog orchestration. In the literature [
31,
32,
33,
34], the term
fog colony is used to refer to an ensemble of computational resources confined to a given geographical region [
35,
36]. Fog nodes are typically located closer to the data source than cloud services and encompass a wide range of devices, including dedicated servers, routers, smartphones or edge devices. Fog computing aligns well with modern IoT applications such as smart cities and industrial IoT, where at least some computationally capable and stationary hardware is available to handle processing and coordination tasks. Many fog orchestration approaches adopt a centralized or hierarchical architecture. In a centralized setup, a dedicated orchestrator unit manages resource allocation and service deployment. In contrast, hierarchical models distribute orchestration responsibilities across multiple layers, where higher-tier nodes coordinate lower-tier ones. The more centralized an architecture, the greater the risk of single points of failure and scalability bottlenecks, even as resource management becomes potentially easier. Hierarchical models improve fault tolerance by partially distributing the orchestration responsibilities but still depend on higher-tier nodes, which can create bottlenecks.
Table 1 summarizes the key differences compared to our approach. It eliminates the need for a central controller as it distributes the decision-making process across all nodes. This distinction is critical in edge environments, where unpredictable network topology changes can render a centralized orchestrator unreliable. Some decentralized approaches introduce
area orchestrators responsible for local resource management, with role rotation mechanisms to enhance robustness [
37]. Our solution, in contrast, manages to orchestrate computation payloads across all nodes without predefined roles or hierarchical coordination, avoiding dependencies on any kind of designated orchestrator unit or role.
Moreover, virtualization and containerization are widely used in fog computing to abstract resources and enable flexible deployment [
38]. These approaches are generally impractical for tiny embedded devices due to severe resource constraints. Naturally, hypervisors and container runtimes, requiring a full software stack (kernel, operating system, networking) and specific hardware virtualization support like Intel VT-x, are infeasible in these scenarios. Additionally, limited bandwidth further complicates deployment, as containers are often too large to be efficiently transferred or updated. As a result, edge orchestration must rely on lightweight deployment strategies, such as bare-metal execution, where the neural network is optimized and pre-deployed directly on the embedded device and where optimization enable efficient execution with minimal resource consumption. In particular, DATOR is tailored to orchestrate already pre-segmented inference steps of machine learning architectures. Parameters (typically weights and biases) are selectively loaded from persistent storage into fast memory statically. To minimize the overhead of frequent context switches and to improve pipeline efficiency, these parameters should be retained in the fast memory for the duration of the relevant task or function.
The orchestration concept presented in this work is designed to be paired with a generic network backend allowing for stable communication within a group of devices, enabling peer-to-peer transmission as well as message routing. The decentralized nature of our approach makes it particularly well-suited for application in mobile mesh networks, in particular Mobile Ad-hoc Networks (MANETs) [
39,
40], which is a self-configuring network of mobile devices that communicate directly without relying on fixed infrastructure or dedicated master nodes. While the framework allows for the implementation of significant fault tolerance (such as acknowledgments and dynamic rescheduling) at the package level, additional fault tolerance in the backend can further enhance stability, especially for routed communication over several hops in volatile environments.
MANETs are used in applications like disaster recovery, environmental monitoring, sensor networks or vehicular networks. Notable examples are Meshmerize, a mesh networking solution designed for industrial applications such as automated warehouses and robotics, providing high-throughput and low-latency communication in environments with frequent obstacles and changing conditions [
41], and BATMAN, a decentralized mesh networking protocol used for large-scale community-driven networks, where nodes proactively exchange routing information to adapt quickly to network changes, ensuring continuous connectivity in urban environments [
42]. Moreover, Meshtastic is an open-source, low-power wireless communication platform that enables ad-hoc mesh networking over long distances using LoRa transceivers, though its LPWAN-based design inherently limits data throughput and message size. Additionally, recent work has proposed an ad-hoc network concept for low-energy Bluetooth edge devices [
43], featuring a particularly lightweight, table-driven routing mechanism, incorporated into standard Bluetooth advertisement messages. For the remainder of this article, we assume the existence of a suitable mesh network backend, enabling stable inter-node communication when connectivity is available, without specifying a particular implementation.
Our approach is particularly well-suited for sequential computations, a characteristic inherent to many deep neural network architectures. This allows to adopt horizontal layer splitting, a technique that distributes consecutive layers across multiple devices to enable efficient pipelined execution. As shown in
Figure 1, one device receives the input data, processes Layers 1 to
j, communicates the resulting data to device 2, which continues to process layers
1 to
, and so on. There is no uniform notion for this technique in the literature, however terms like Horizontal Partitioning [
44], Layerwise Splitting [
45], Layer Pipelining [
46] or Sequential Layer Mapping [
47] describe the same principle. An excellent survey article is provided by Reference [
6].
A key advantage of this approach is the immediate reduction of memory requirements per device. By only requiring a subset of the neural network parameters, weights can reside in the local cache. This minimizes the need for frequent reloading, thereby reducing energy consumption and latency since permanent storage access is slow. Because different layers in a deep learning system naturally have varying numbers of parameters, this method is naturally suited for distribution across heterogeneous hardware [
48]. Finally, the implementation and deployment overhead of layer-split models is low, since we only deploy the lightweight, inference-ready model in the production phase. Training is done, as usual, on stronger hardware and before deployment, without any additional overhead.
The identification of optimal split points is a standard multi-objective optimization problem and has been discussed in the literature (compare [
6] and references therein). We define an inference
step as the sum of computation performed by a group of layers between two split points. The final inference result is available once a data package has traversed all steps in the correct, sequential, order. Note that a large number of split points poses no significant overhead, as consecutive steps can be loaded onto the same device, effectively treating them as a single, larger, sequence of layers. Finally, we note that the layer splitting approach is obviously limited for neural networks with non-sequential architectures, such as multi-modal inputs, branching or long-distance skip connections. While our concept could, in principle, be extended to handle arbitrary directed acyclic graph (DAG) architectures, i.e., execute parallel branches, this is left for future work.
3. Materials and Methods
A core characteristic of our concept is that each device within an ensemble of network nodes operates independently. In other words, the orchestration system is fully distributed and does not rely on a central unit. In a cluster, all nodes operate on the same functional core, irrespective of their hardware capabilities. Relevant nodes roles (e.g., sensor node, inference worker, routing unit, or cloud/satellite gateway) are pre-defined but not statically assigned and instead dynamically assumed at runtime. Hardware limitations may restrict certain specialized roles (e.g., camera input, satellite uplink, neural accelerator) to nodes possessing the necessary equipment. The adaptive role adjustment is enabled by an event-driven architecture, in which operations are always initiated in response to incoming trigger messages, as sketched in
Figure 2. Decisions are made locally on the respective device. This guarantees a lightweight execution with negligible algorithmic and memory overhead. This requires messages to be formalized in a way that they can be processed effectively at the relevant decision points. For this purpose packages are aware of the currently required step of the computation as well as the intended final target of the inference result.
At its core, we follow the principles of the Contract Net Protocol (CNP) [
49,
50,
51], employing an auction-based broadcast–response mechanism for decentralized task orchestration. Whenever a cluster participant identifies a new data package in its local queue, it issues periodic request messages describing the required inference step for that specific payload. These requests are broadcast to all reachable nodes within the network, independent of whether the payload originates from a sensor input or from the output of a prior computation stage. Upon receiving such a request, each node evaluates its local capabilities and current workload. If the corresponding computation module is available and the node is idle or within acceptable utilization bounds, it replies with an acceptance message. Once the original requester collects one or more confirmations, it dispatches the data payload to the selected responder(s) for execution. Beyond this basic request-accept scheme, the framework allows for the incorporation of more elaborate bidding strategies–such as weighted auctions–to account for heterogeneous processing capabilities, network latency, and energy constraints across the swarm.
The recipient node immediately begins its part of the inference process, which is finished once the next required step is not loaded. The package is moved to the output queue, and the node initiates another auction round to find other network participants that can process this intermediate result by performing the next required inference step. This process continues until all steps have been successfully executed and the result has been sent to the final target, which is encapsulated within the package meta data.
We emphasize that an ensemble of such nodes can operate with an arbitrary and dynamically changing distribution of inference steps. Packages automatically find their path “through the network”, ensuring that the steps are executed in the correct order and that the result is eventually sent to the intended target. It is clear that if the target node (e.g., equipped with cloud uplink) permanently leaves the system, proper operation is compromised. To address this possibility, one may employ backup target nodes or utilize persistent storage for inference results. Temporary target node outages present no issues due to the system’s ability to seamlessly reintegrate the node upon return.
Since there is no functional difference of whether a data packet is received from another node as the result of a previous computation or directly from an input device, it becomes clear that any node can be fed with suitably shaped data at any time. Incoming data is formally encapsulated by a work package manager (compare
Figure 2), which provides queuing functionality for the sending and receiving process and can filter duplicates. It can also be used for monitoring of work packages, or for issuing retransmissions if packages appear to have been lost (indicated by, e.g., time limits or failed acknowledgments).
4. Results
The following simulations provide an initial proof of concept that demonstrates the viability of the proposed approach, whereas a full experimental validation will be addressed in future work. In order to demonstrate how the system achieves pipelining, with nodes working concurrently, we simulate a cluster of five devices in
Figure 3. The principal network topology is depicted in the lower right of the figure. For clarity, all orchestration messages (requests, responses, acknowledgments) have been omitted in the figure, which hence shows only actual payload data transfer and computation events. We simulate a scenario, where the data pipeline is segmented into six steps, starting from
feed (purple), i.e., gathering and preprocessing sensor data, followed by four sequential inference blocks (orange to red) and eventually an upload of the result (grey). Colored boxes in front of specific rows indicate which steps of this task are currently loaded on the respective device. Only one of the six steps can be executed at a time, while sending and receiving are, in the examples presented here, handled concurrently. Specifically, Node 1 is equipped with both a sensor and an upload gateway, but cannot perform any parts of the actual inference computation. Five work packages (A, B, C, …) are sequentially introduced into the system. If we follow package D, entering the system around
on Node 1 (purple event), we find that it is first processed by Node 5, as the closer Node 2 is currently busy. Even though consecutive execution of three steps on the same device might be efficient, the framework is not inherently aware of such optimizations. Instead, package routing decision are made locally, in this case based solely on availability and the order of accept arrivals. After the first three inference steps are completed, the package is routed through Nodes 3 and 1 before reaching Node 4, which handles the final computation step and sends the package back to Node 1 for result handling. Note that redundant allocation of certain computation steps increases the degree of parallel execution. In this example, only Step 4 is available on a single node (Node 4), possibly creating a bottleneck.
In another set of simulations we evaluate the distributed inference performance depending on the rate of incoming inference elements (e.g., the frequency of images captured by a camera connected to Node 1). We employ the same network topology and step configuration that has been used above (compare
Figure 3). Node 1 is continuously fed with inference packages, separated by defined time intervals. The inverse of these interval constitutes the incoming package rate
. Durations of the individual steps are modeled such as to mimic a generic CNN-based vision or object detection architecture [
52]. In particular, we set for the step computing durations
s, with the most computational load (in terms of FLOPs) residing on the first few convolution layers–which typically present a large number of channels–as well as the classification section at the end with often multiple fully connected layers. The transmission duration of intermediate results between the steps are modeled as
s, reflecting a generic VGG-like architecture with a larger number of activations in earlier layers and progressively fewer towards the classifier. Sensor data acquisition and preprocessing (only performed by Node 1) account for
s. Transmission after preprocessing to inference step 1, as well as from step 4 to the target node is considered small,
s, since only little data is transferred (low-resolution image and few bytes encoding the result, respectively). Since orchestration messages are designed to be particularly lightweight and are efficiently handled by the framework message handling architecture discussed in
Figure 2, their transmission times
are also significantly shorter compared to other processes. In summary this leads to the following comparison of timescales,
. Note that transmission times are given for each link in the path (i.e., per hop). Moreover, all timings have been randomized in order to capture environmental influences present in a real-world setting more realistically.
We evaluate system performance parameters under four distinct network scenarios, detailed in
Table 2. We begin by analyzing the
stable scenario, represented by the blue curves in
Figure 4 and
Figure 5. This baseline scenario simulates a network with no failures. At low incoming packet rates
, the average packet latency (the time from packet submission to result handling) remains constant. In this regime, packets are processed sequentially without queuing or mutual interference. Consequently, throughput increases approximately linearly with
.
Figure 5e) shows that Node 5 is initially inactive. This is because Node 2, which also performs the first inference step, is located closer to Node 1 (less hops). However, as the system becomes more congested (around
), Node 5 begins processing packets and quickly assumes a share of the workload. Average system latency remains low until approximately
, where the incoming packet rate exceeds the combined processing capacity of Nodes 2 and 5 for inference step 1. At this point, Node 4 also becomes a bottleneck, as can be seen in
Figure 5d). Further increases in
lead to a traffic congestion regime, where both latency and throughput saturate, indicating the maximum achievable performance for this configuration.
In the unstable scenario (orange curves), the unreliability of Node 2 leads to immediate involvement of Node 5 in packet processing. The average latency begins to increase at lower values of compared to the stable scenario. This is because the intermittent unavailability of Node 2 not only reduces the efficiency of inference step 1 but also disrupts routing paths temporarily. The unstable system eventually reaches saturation, however it does so at a higher value and lower throughput than the stable configuration.
The 5% message loss scenario (green curves) exhibits similar characteristics. The baseline latency is elevated due to the delays introduced by message loss and the subsequent rescheduling of orchestration messages or payloads. Given that a four-step inference process requires five successful transmissions (including preprocessing and result gathering), there is a probability of for failure in a chain of n consecutive messages, which is approximately 22.6% for . Similar to the unstable scenario, Node 5 becomes active already for small . However, the average idle time is lower than in the unstable scenario. This difference likely stems from the fact that the unstable scenario reduces available processing options due to node failures, whereas in the loss scenario, devices can continue their regular operations, even if some messages are lost.
Finally, we consider the slow network scenario (red curves). As expected, the baseline latency for purely sequential computation is higher than in the fast stable case, which can be directly attributed to the increased transmission latency. Node 5 becomes active even at very low packet rates and shows a significant increase in activity around . This slow network configuration reaches the traffic congestion regime earlier (around ) and exhibits a higher average latency and lower throughput, similar to the fast but unstable or loss scenarios. The average idle time is notably lower compared to all other scenarios. This is particularly evident for Node 1, which increases its activity from approximately 10–30% in the fast scenarios to about 60% in the slow network. This increased workload is attributed to the fact that Node 1 is not only preprocessing and distributing initial packets but also serves as a central routing hub for communication between Nodes 5 or 4 and Node 3. A similar increase in activity is observed for Nodes 2 and 3. Only the “end nodes”, 4 and 5, maintain similar activity levels compared to the fast scenarios.
5. Discussion
We present a concept for distributed machine learning inference in a mobile ad-hoc network of low-powered edge devices. By deploying an inference task as pre-defined steps, our approach becomes particularly lightweight as only the neuron activations at the split points, along with minimal header information, are transmitted between computations. Appropriate split points can minimize activation sizes and hence transmissions loads. For optimal pipelining performance and energy efficiency, we recommend partitioning the network such that the neural network weights and biases associated with each inference step fit within the cache of the specific edge devices used in the application, thereby mitigating the overhead of frequent context switching. Our approach can be well complemented by established neural network compression methods, such as pruning or quantization.
It is evident that even in the context of only few devices, volatile ad-hoc networks exhibit a large number of degrees of freedom. There can be temporary activation and deactivation of nodes, dynamic addition and removal of nodes or connections, dynamic allocation of computational tasks, transmission interference leading to spontaneous message loss, or injection of data packages anywhere and anytime. These factors create combinatorially vast possibilities for the realization of task paths in the orchestration process. The self-organized design principle, which is capable of handling these dynamic conditions at runtime and is able to route packages accordingly is hence not merely convenience but a necessity.
A cluster of DATOR devices is not restricted to inference of a single machine learning task but can handle an entire procedure of tasks. Consider a wildlife research scenario with lightweight mobile nodes, e.g., drones or tagged animals [
53,
54]. Particular nodes, equipped with a camera, periodically capture images. An object detection architecture (Task 1) identifies animals in these images. Upon detection, a second neural network (e.g., a Super-Resolution Generative Adversarial Network, SRGAN), enhances details in the corresponding region of interest (Task 2). The feed-forward architecture of the generator part (convolution layers, upsampling layers, residual blocks) makes it suitable for integration within our sequential step design. If the researcher is, for instance, interested in specific birds, a third classification task (Task 3, possibly several steps) can employed. The recognized species is then communicated via satellite (Task 4, single step).
Figure 6 illustrates the conceptual process diagram for this multistage distributed on-edge AI classification system using our framework.
The present work provides a proof-of-concept, successfully showcasing the desired functionality in simulations. This work should be viewed as a starting point for future research, with several avenues for extension and refinement. Currently, inference steps are statically assigned and manually adjusted via trigger messages. Simple heuristics, such as a device automatically assuming a step after repeated rejections by others, or an initiating node delegating tasks, are currently being implemented. A more advanced approach, currently under development, involves adaptive and self-organized task reassignment based on on-demand resource and hardware constraints. Furthermore, incorporating a short-term memory mechanism to prioritize stable connections and paths over unstable ones is currently under investigation.
Our systematic evaluations have thus far been limited to small- to medium-sized networks (up to approximately ten nodes). Much larger meshes are feasible, and in principle, their work performance would benefit from the redundancy provided by the distributed computational steps. However, a concern for scaling involves the management of communication overhead, as network-wide request messages can quickly lead to channel saturation. Therefore, appropriate lightweight strategies to mitigate this effect are currently under development.
Also, we aim to further enhance fault tolerance and load balancing. More elaborate acknowledgment mechanisms, well-suited to our lightweight messaging pattern, are planned to further improve robustness against disruptions. While the current framework inherently balances loads (occupied resources become unavailable for other computations), we envision further refinements. Including weights or minimal hardware information (e.g., battery levels, perhaps already provided by the network backend) within orchestration messages will enable prioritized task path decisions and hence a more nuanced load management.
Most importantly, we are in the process of deploying the system on actual hardware in a real use case. This step will enable a more quantitative empirical evaluation, including measurements of energy consumption, latency, communication overhead, and system behavior under realistic environmental volatility. Such validation will be essential to assess practical performance and to guide possible improvements of the orchestration concept. Ultimately, within an ongoing project, we plan to deploy the system in an existing use-case in the field by equipping animal tags designed for AI-driven behavioral analysis with this technology as part of the GAIA biodiversity protection project locates in the Etosha national park in Namibia [
53,
54].