DistMLLM: Enhancing Multimodal Large Language Model Serving in Heterogeneous Edge Computing

Yuan, Xingyu; Chen, Hui; Liu, Lei; Li, He

doi:10.3390/s25247612

Open AccessArticle

DistMLLM: Enhancing Multimodal Large Language Model Serving in Heterogeneous Edge Computing

¹

Department of Sciences and Informatics, Muroran Institute of Technology, Muroran 050-8585, Hokkaido, Japan

²

Guangzhou Institute of Technology, Xidian University, Guangzhou 510555, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(24), 7612; https://doi.org/10.3390/s25247612

Submission received: 5 November 2025 / Revised: 9 December 2025 / Accepted: 12 December 2025 / Published: 15 December 2025

(This article belongs to the Special Issue Edge Computing for Beyond 5G and Wireless Sensor Networks)

Download

Browse Figures

Versions Notes

Abstract

Multimodal Large Language Models (MLLMs) offer powerful capabilities for processing and generating text, image, and audio data, enabling real-time intelligence in diverse applications. Deploying MLLM services at the edge can reduce transmission latency and enhance responsiveness, but it also introduces significant challenges due to the high computational demands of these models and the heterogeneity of edge devices. In this paper, we propose DistMLLM, a profit-oriented framework that enables efficient MLLM service deployment in heterogeneous edge environments. DistMLLM disaggregates multimodal tasks into encoding and inference stages, assigning them to different devices based on capability. To optimize task allocation under uncertain device conditions and competing provider interests, it employs a multi-agent bandit algorithm that jointly learns and schedules encoder and inference tasks. Extensive simulations demonstrate that DistMLLM consistently achieves higher long-term profit and lower regret than strong baselines, offering a scalable and adaptive solution for edge-based MLLM services.

Keywords:

edge computing; large language model; task allocation; multi-agent bandit

1. Introduction

Large Language Models (LLMs) have emerged as powerful tools capable of understanding and generating human-like responses across a wide range of tasks. Notable examples include OpenAI’s GPT-3 [1], Google’s PaLM [2], and Meta’s LLaMA [3]. While these models have demonstrated remarkable performance on text-based tasks, their capabilities remain largely confined to textual inputs, limiting their effectiveness in scenarios that require processing diverse modalities. To address this limitation, Multimodal Large Language Models (MLLMs) have been developed to integrate and interpret multiple forms of data, such as text, images, and audio. This multimodal capability enables MLLMs to handle complex real-world tasks with a level of sophistication that surpasses traditional LLMs. Recent advancements have driven their adoption in diverse domains such as healthcare [4], autonomous driving [5], and customer service applications, where real-time understanding and response to multimodal inputs are critical.

As the demand for low-latency, intelligent services increases, traditional cloud-based solutions face growing limitations. Centralized architectures often incur high latency, bandwidth constraints, and privacy concerns, which can impair the responsiveness of MLLM-powered systems. Deploying MLLMs at the network edge offers a promising alternative [6,7]. However, edge deployment presents significant challenges, primarily due to the heavy computational demands of these models, which often exceed the capabilities of typical edge devices [8,9]. Figure 1a shows the inference performance of three LLaMA models on three representative edge devices. Device specifications are provided in Table 1. Among the tested hardware, the RTX 4090 (NVIDIA, Santa Clara, CA, USA) delivered the highest performance across all model sizes, while the Jetson Orin Nano (NVIDIA, Santa Clara, CA, USA) was unable to complete inference for the 13B and 70B models, underscoring its limited capacity for large-scale deployment. Beyond model inference, multimodal applications also rely on efficient data encoding. To evaluate this aspect, we benchmarked the processing times of three representative encoders—FLAVA (facebook/flava-full), ViT (google/vit-base-patch16-224), and CLIP (openai/clip-vit-base-patch32) [10]—on the same edge devices, as shown in Figure 1b. The RTX 4090 achieved the fastest performance, while the Jetson AGX Orin (NVIDIA, Santa Clara, CA, USA) and Orin Nano also completed encoding tasks within 1 s, indicating that lightweight encoding is feasible even on resource-constrained platforms.

These observations highlight a broader challenge in edge deployment: device heterogeneity. Edge environments typically consist of devices with varying computational capacities, memory sizes, and communication bandwidths. Several works have explored methods to improve performance in heterogeneous edge environments. EdgeFlow [11] introduces a progressive partitioning strategy for deep neural networks, distributing model layers across multiple edge devices to reduce inference latency. This DAG-based framework enables fine-grained model decomposition, achieving significant acceleration in distributed inference. In parallel, Liu et al. [12] propose a joint optimization framework that integrates task offloading with resource allocation, aiming to minimize system latency under dynamic edge conditions. Their approach models system constraints explicitly and leverages algorithmic coordination to improve responsiveness across devices. While these methods enhance performance in specific edge inference or offloading settings, they mainly target single-task pipelines and do not address the coordination of interdependent subtasks within multimodal workloads. Moreover, they overlook the economic objectives of real-world service providers. In practice, providers must balance task latency, resource utilization, and operational profit when scheduling MLLM workloads across heterogeneous edge devices. This makes profit-oriented scheduling not only desirable but essential for sustainable edge deployment.

Recent efforts have explored deploying LLMs in edge or edge–cloud collaborative systems to enhance responsiveness and resource efficiency. He et al. [13] propose an active inference-based offloading mechanism that improves the adaptability of LLM deployment without relying on explicit reward signals. By replacing traditional DRL objectives with task-specific inference guidance, their method achieves better generalization under varying workloads. In a more comprehensive setting, Huang et al. [14] design a two-timescale optimization framework for MLLM deployment and scheduling in edge–cloud environments. Their approach combines hierarchical reinforcement learning with attention and memory modules to jointly optimize model placement, GPU provisioning, and resource allocation across space and time. Both works demonstrate the potential of learning-based strategies for managing LLM workloads, but they rely on centralized architectures or domain-specific assumptions, and do not explicitly address the challenges of disaggregated MLLM execution under multi-provider, profit-driven constraints.

To address these challenges, we propose DistMLLM, a novel framework designed to optimize the performance of MLLM services in heterogeneous edge environments. As illustrated in Figure 2, service providers first disaggregate multimodal language tasks and transmit raw multimodal data to less powerful edge devices, which are better suited for handling data-intensive operations. These devices perform multimodal encoding using lightweight GPUs and return the encoded tensor representations to the providers. The compact tensors are then forwarded to more powerful edge nodes for LLM inference. Finally, the inference results are sent back to the providers. This disaggregated execution strategy exploits the complementary strengths of heterogeneous devices while mitigating their individual limitations.

Efficient task allocation in such environments remains challenging due to unpredictable device loads and the conflicting interests of multiple service providers [15]. Multi-agent bandit algorithms have been widely used for decentralized task dispatch in uncertain and dynamic environments. For example, Chen et al. [16] propose

C r o w d^{2}

, a bandit-based mechanism for assigning video analytics tasks in crowdsourcing platforms, which improves social welfare while maintaining fairness and sub-linear regret. Inspired by such approaches, DistMLLM incorporates a multi-agent bandit mechanism that operates across two layers: (i) an online learning layer that continuously estimates the utility of executing encoding and inference tasks on heterogeneous workers, and (ii) an allocation layer that performs coordinated task assignment based on these learned utilities. This design enables dynamic adaptation to real-time performance fluctuations and competing demands, ultimately maximizing provider profits while meeting task delay constraints.

Simulation results show that DistMLLM achieves the highest long-term profit and the lowest cumulative regret for both LLM and encoder tasks, consistently outperforming heuristic and UCB-based baselines. Its structured exploration enables faster convergence to high-reward assignments, while coordinated task allocation improves efficiency in dynamic and heterogeneous environments. DistMLLM also demonstrates strong scalability and robustness to transmission cost, with its disaggregated design yielding clear performance gains over monolithic execution.

2. Materials and Methods

2.1. Problem Formulation

In this section, we propose an online algorithm for task allocation in MLLM service providers. We aim to maximize the profit for all MLLM service providers while ensuring fairness and flexibility in task allocation. Table 2 summarizes the key notations used in the problem formulation.

We denote the set of MLLM service providers as

M = {1, 2, . . ., M}

and the set of edge workers as

N = {1, 2, . . ., N}

. The time is divided into discrete time slots

T = {1, 2, . . ., T}

. In our system,

x_{m, t}^{e n c}

represents the edge worker selected by MLLM service provider m to process the multimodal encoder at time slot t, and

x_{m, t}^{l l m}

represents the edge worker selected to process the large language model at time slot t.

The latency of the multimodal encoder and large language model inference performed by edge worker n when chosen by service provider m is denoted as

l_{m, n}^{e n c}

and

l_{m, n}^{l l m}

, respectively. The energy consumption for these tasks is denoted as

e_{m, n}^{e n c}

and

e_{m, n}^{l l m}

. The reward for service provider m when choosing edge worker n for these tasks is denoted as

r_{m, n}^{e n c}

and

r_{m, n}^{l l m}

. Additionally,

c_{m}

represents the computation demand of service provider m in each time slot, and

C_{n}

represents the computing capacity of edge worker n in each time slot.

The latency of the multimodal encoder and large language model inference depends on the input data size

d_{m}

and the device performance

p_{n}

of the edge worker. The latency models are given by:

l_{m, n}^{e n c} = ψ_{n}^{e n c} (d_{m}) β_{n} (p_{n}),

(1)

l_{m, n}^{l l m} = ψ_{n}^{l l m} (d_{m}) β_{n} (p_{n}),

(2)

where the functions

ψ_{n}^{e n c} (d)

and

ψ_{n}^{l l m} (d)

represent the relationship between input data size and device performance with latency, respectively. These functions are concave.

Energy consumption mainly includes transmission and processing energy. The energy consumption models are given by:

e_{m, n}^{e n c} = (γ^{e n c} + μ) d_{m} α {(p_{n})}^{2},

(3)

e_{m, n}^{l l m} = (γ^{l l m} + μ) d_{m} α {(p_{n})}^{2},

(4)

where

α {(p_{n})}^{2}

represents the device performance factor, and

γ^{e n c}

,

γ^{l l m}

, and

μ

represent the energy consumption for transmitting and processing one bit of data, respectively.

The reward for service provider m depends on the intrinsic value of the task, the latency of the multimodal encoder and large language model inference, and the energy consumption. The reward models are given by:

r_{m, n}^{e n c} = V (m) + G_{m}^{e n c} (l_{m, n}^{e n c}) - ω_{n} e_{m, n}^{e n c},

(5)

r_{m, n}^{l l m} = V (m) + G_{m}^{l l m} (l_{m, n}^{l l m}) - ω_{n} e_{m, n}^{l l m},

(6)

where

V (m)

represents the intrinsic value of the task,

G_{m}^{e n c} (l)

and

G_{m}^{l l m} (l)

are concave reward functions concerning latency, and

ω_{n}

represents the payment priced by edge worker n for consuming each unit of energy.

The optimization objective is to maximize the profit for all MLLM service providers, which can be represented as:

P : max_{x_{t}^{e n c}, x_{t}^{l l m} \in T} \sum_{t \in T} \sum_{m \in M} (E [r_{m, x_{m, t}^{e n c}}^{e n c} (t)] + E [r_{m, x_{m, t}^{l l m}}^{l l m} (t)]),

(7)

subject to:

\sum_{m \in M_{n, t}} c_{m} \leq C_{n}, \forall n \in N, \forall t \in T .

(8)

In time slot t, a decision

x_{t}^{*} = (x_{m, t}^{*})

has the property of proportional fairness if, for any other feasible decision

x_{t}^{'} = (x_{m, t}^{'})

, the following inequality holds:

\sum_{m \in M} \frac{r_{m, x_{m, t}^{'}} - r_{m, x_{m, t}^{*}}}{r_{m, x_{m, t}^{*}}} \leq 0 .

(9)

2.2. Algorithm Design

To tackle the stochastic changes in rewards, we design a online algorithm based on a multi-agent bandit approach. This algorithm explores the potential rewards from different edge workers and exploits the best-known configurations.

The proposed algorithm follows these detailed steps: First, initialize the performance data for each edge worker and set the parameters for exploration and exploitation. In the exploration phase, each MLLM service provider randomly allocates both the multimodal encoder and the large language model tasks to different edge workers for a predefined number of rounds. The performance estimates are updated based on the observed latency and energy consumption. Next, the multi-knapsack problem solver is called to determine the optimal allocation mapping in the exploitation phase. Tasks are dispatched based on this optimal allocation mapping for a predefined number of exploitation rounds. Throughout this phase, the performance estimates are continuously updated to adapt to any changes in edge worker performance.

2.3. Mapping upon Multi-Knapsack Problem

To obtain the optimal task allocation mapping

Π

, we model it as a multi-knapsack problem for each time slot t. The objective is to maximize the utility function, which combines the performance of the multimodal encoder and the large language model tasks.

max_{x_{t}} \sum_{m \in M} (u_{τ m, x_{m, t}^{e n c}}^{e n c} + u_{τ m, x_{m, t}^{l l m}}^{l l m}),

(10)

subject to:

\sum_{m \in M_{n, t}} c_{m} \leq C_{n}, \forall n \in N .

(11)

The mapping algorithm is designed to solve this multi-knapsack problem iteratively, updating the allocation mapping

Π

based on the observed performance.

The regret in our algorithm is defined as the difference between the reward achieved by the optimal allocation and the reward obtained by our algorithm. The proposed algorithm balances exploration and exploitation effectively to minimize regret. The regret can be quantified as follows:

R (T) = \sum_{t = 1}^{T} \sum_{m \in M} (r_{m, x_{m, t}^{*}} - r_{m, x_{m, t}}),

(12)

where

r_{m, x_{m, t}^{*}}

is the reward for the optimal allocation and

r_{m, x_{m, t}}

is the reward achieved by our algorithm at time t. The algorithm aims to reduce this regret over time by alternating between exploration and exploitation.

Exploration and exploitation are critical components of our algorithm. During exploration, MLLM service providers gather information about the performance of various edge workers by randomly allocating tasks. This phase is essential to build an accurate edge workers’ capabilities model. In each epoch, the exploration phase uses a fixed exploration time

T_{explore}

to gather sufficient data. The estimated reward

{\hat{u}}_{m, n}^{e n c}

is initialized as 0 and updated as:

{\hat{u}}_{m, n}^{e n c} = ({\hat{u}}_{m, n}^{e n c} \times (τ - 1) + u_{m, n}^{e n c}) / τ,

(13)

{\hat{u}}_{m, n}^{l l m} = ({\hat{u}}_{m, n}^{l l m} \times (τ - 1) + u_{m, n}^{l l m}) / τ,

(14)

where

τ

is the current epoch. In the exploitation phase, the algorithm uses the gathered information to make informed decisions, allocating tasks to the best-performing edge workers to maximize rewards. By alternating between exploration and exploitation, our algorithm ensures that it remains adaptive and continues to optimize task allocations even as conditions change.

2.4. Algorithm Analysis

This section analyzes the proposed online task allocation algorithm. We provide high-probability estimation accuracy bounds, a robustness condition under which the mapping is invariant to estimation error, a characterization of the multi-knapsack subroutine based on branch-and-bound (B&B), and a logarithmic regret bound.

Unless otherwise stated, rewards are assumed to be uniformly bounded. For each provider–worker pair

(m, n)

, let

u_{m, n}^{e n c} \in [u_{min}, u_{max}],

where

u_{max} : = {max}_{m, n} u_{m, n}^{e n c}

and

u_{min} : = {min}_{m, n} u_{m, n}^{e n c}

denote the maximal and minimal feasible utility values across all assignments. Thus,

u_{min} \leq u_{m, n}^{e n c} \leq u_{max}, R : = u_{max} - u_{min} < \infty,

(15)

capacities satisfy

C_{n} > 0

for each worker

n \in N

, and item costs satisfy

c_{m} \geq c_{min} > 0

for each service provider

m \in M

. We denote

N_{max} : = max_{n \in N} ⌊\frac{C_{n}}{c_{min}}⌋, B : = \sum_{n \in N} ⌊\frac{C_{n}}{c_{min}}⌋ (\leq N N_{max}) .

(16)

Algorithm 1 proceeds in epochs

τ = 1, 2, \dots

. Each epoch consists of (i) an exploration phase of

T_{explore}

slots and (ii) an exploitation phase of length

2^{τ}

where Algorithm 2 (the multi-knapsack subroutine) is invoked using current estimates

{{\hat{u}}_{m, n}^{e n c}}

. After

τ

epochs, each pair

(m, n)

has accumulated at least

τ

i.i.d. reward samples. The LLM side is symmetric and omitted for brevity.

Algorithm 1 Task Allocation with Multi-Agent Bandit

1:: Input: exploration length $T_{explore}$ , exploitation length $T_{exploit} = 2^{τ}$
2:: Initialize $u_{τ m, n}^{e n c} \leftarrow 0$ , $u_{τ m, n}^{l l m} \leftarrow 0$ , $\forall m \in M, \forall n \in N$
3:: for $τ = 1$ to $τ_{T}$ do
4:: for $t = 1$ to $T_{explore}$ do
5:: for each service provider $m \in M$ do
6:: Each provider m selects a worker $n \in N$ in a round-robin order and sends encoder and LLM subtasks, observing rewards $u_{m, n}^{e n c} (τ)$ and $u_{m, n}^{l l m} (τ)$
7:: Update $u_{τ m, n}^{e n c} \leftarrow (u_{(τ - 1) m, n}^{e n c} \times (τ - 1) + u_{m, n}^{e n c} (τ)) / τ$
8:: Update $u_{τ m, n}^{l l m} \leftarrow (u_{(τ - 1) m, n}^{l l m} \times (τ - 1) + u_{m, n}^{l l m} (τ)) / τ$
9:: end for
10:: end for
11:: Call Algorithm 2 with input ${u_{τ m, n}^{e n c}, u_{τ m, n}^{l l m}}$ and $Π = 0$
12:: for $t = 1$ to $T_{exploit}$ do
13:: Service providers dispatch MLLM inference tasks based on $Π$
14:: end for
15:: end for

Algorithm 2 Multi-Knapsack Mapping

1:: Input: ${u_{τ m, n}^{e n c}, u_{τ m, n}^{l l m}}$ , $Π = 0$
2:: for each edge worker $n \in N$ do
3:: for each service provider $m \in M$ do
4:: if $Π_{m} \neq 0$ then
5:: $Δ u_{τ m, n}^{e n c} \leftarrow u_{τ m, n}^{e n c} - u_{τ m, Π_{m}}^{e n c}$
6:: $Δ u_{τ m, n}^{l l m} \leftarrow u_{τ m, n}^{l l m} - u_{τ m, Π_{m}}^{l l m}$
7:: else
8:: $Δ u_{τ m, n}^{e n c} \leftarrow u_{τ m, n}^{e n c}$
9:: $Δ u_{τ m, n}^{l l m} \leftarrow u_{τ m, n}^{l l m}$
10:: end if
11:: end for
12:: Solve the knapsack sub-problem for worker n with $Δ u_{τ m, n}^{e n c}$ and $Δ u_{τ m, n}^{l l m}$
13:: for each service provider $m \in M$ do
14:: if worker n accepts task from provider m then
15:: Update $Π_{m} \leftarrow n$
16:: end if
17:: end for
18:: end for
19:: Output: $Π$

Lemma 1.

After τ independent samples for pair

(m, n)

, for any

ϵ > 0

,

Pr (| {\hat{u}}_{m, n}^{e n c} - u_{m, n}^{e n c} | > ϵ) \leq 2 exp (- \frac{2 τ ϵ^{2}}{R^{2}}) .

(17)

Consequently, for the collection of all

M \times N

pairs,

Pr (\exists (m, n) : | {\hat{u}}_{m, n}^{e n c} - u_{m, n}^{e n c} | > ϵ) \leq 2 M N exp (- \frac{2 τ ϵ^{2}}{R^{2}}) .

(18)

Proof.

Apply Hoeffding’s inequality to

τ

i.i.d. samples supported on an interval of width R. A union bound over

M \times N

pairs yields the second inequality. □

Remark 1.

For any target failure level

η_{τ} \in (0, 1)

at epoch τ, one can choose the number of exploration samples per pair

(m, n)

large enough so that

2 M N exp (- \frac{2 τ ϵ_{τ}^{2}}{R^{2}}) \leq η_{τ},

(19)

where

ϵ_{τ}

is the desired accuracy in epoch τ. In practice, this can be enforced by letting each pair

(m, n)

obtain at least one additional sample per epoch, so that after τ epochs each pair has at least τ samples, and by decreasing

ϵ_{τ}

as τ increases.

Lemma 2.

Let

Π^{★}

be the optimal mapping under true rewards with total value

V (Π^{★})

. Define the optimality gap

Δ_{gap} : = min_{Π \neq Π^{★}} (V (Π^{★}) - V (Π)) (> 0) .

(20)

If

{max}_{m, n} | {\hat{u}}_{m, n}^{e n c} - u_{m, n}^{e n c} | \leq ε

and

ε \leq Δ_{gap} / (2 B)

, then the mapping

\hat{Π}

derived from

{{\hat{u}}_{m, n}^{e n c}}

coincides with

Π^{★}

.

Proof.

Any feasible mapping contains at most B assignments. Entrywise perturbations bounded by

ε

change any mapping’s total value by at most

B ε

. Hence the estimated advantage of

Π^{★}

over any

Π \neq Π^{★}

is at least

Δ_{gap} - 2 B ε \geq 0

, preserving optimality. □

We instantiate Algorithm 2 as a B&B solver for each multi-knapsack subproblem constructed from the current estimates

{{\hat{u}}_{m, n}^{e n c}}

. When fully executed, B&B is exact; with time or node limits, it behaves as an anytime heuristic with a certified optimality gap.

Lemma 3.

When executed to completion (without time/node limits), the B&B solver returns the global optimum of the multi-knapsack subproblem. Its worst-case running time is exponential. Under time- or node-limited execution, B&B becomes an anytime heuristic that returns a feasible incumbent along with a valid upper bound, thereby certifying an optimality gap at termination; no fixed constant-factor approximation ratio is guaranteed.

Proof.

When executed without time or node limits, B&B explores a search tree where each node carries a valid upper bound and the algorithm maintains a feasible incumbent solution. Since pruning only removes nodes whose upper bound is no larger than the incumbent, full exploration guarantees that no optimal solution is discarded. Hence, B&B returns the exact global optimum when allowed to run to completion, although its worst-case complexity is exponential due to the NP-hardness of the multi-knapsack problem.

When terminated early, B&B functions as an anytime heuristic. At termination, it retains a feasible incumbent of value L and a valid upper bound U, and therefore provides a certified optimality gap

U - L

. Because the incumbent may lie anywhere in the search tree, no instance-independent constant-factor approximation ratio can be guaranteed under arbitrary time limits. □

Theorem 1.

Let Algorithm 1 run for T slots with the epoch-based schedule described above and Algorithm 2 instantiated by the B&B solver as in Lemma 3. Then, for the ENC pipeline, the cumulative regret is bounded by

R_{ENC} (T) \leq (T_{explore} + M) N u_{max} {log}_{2} (T + 2) + C_{0} = O (log T),

(21)

where

C_{0} = O (N^{2} M u_{max})

is a constant independent of T. An identical bound holds for the LLM pipeline, i.e.,

R_{LLM} (T) = O (log T)

.

Proof.

Let

τ_{T}

be the last completed epoch by time T. Since the exploitation length of the

τ

-th epoch equals

2^{τ}

, we have

T \geq \sum_{τ = 1}^{τ_{T}} 2^{τ} = 2 (2^{τ_{T}} - 1) \Rightarrow τ_{T} \leq {log}_{2} (T + 2) .

(22)

We decompose the regret into three components.

(i) Exploration. During exploration, each provider collects

T_{explore}

samples per epoch, and the reward is at most

u_{max}

for each provider–worker pair. Thus, the regret contributed during exploration is at most

(T_{explore} N u_{max}) τ_{T}

.

(ii) Solver suboptimality. When B&B is run without time limits, the multi-knapsack subproblem is solved optimally and this term becomes zero. Under practical time budgets, B&B returns a feasible incumbent with a certified optimality gap, and the loss per epoch remains uniformly bounded. Therefore, the cumulative contribution of this term is absorbed into the constant

C_{0}

.

(iii) Estimation errors. Let

E_{τ}

denote the event that the uniform estimation bound in Lemma 1 fails at epoch

τ

. The sampling schedule ensures

Pr (E_{τ}) \leq η_{τ}

, where

η_{τ}

decays exponentially. The worst-case regret in epoch

τ

is

O (2^{τ} N M u_{max})

, and the expected regret is thus

O (2^{τ} N M u_{max} η_{τ})

. Selecting

η_{τ} \leq 2^{- 2 τ}

ensures that

\sum_{τ \geq 1} 2^{τ} η_{τ}

converges, introducing only the constant term

C_{0}

.

Combining the above three parts and using

τ_{T} \leq {log}_{2} (T + 2)

yields the claimed

O (log T)

bound. □

Remark 2

(On the exploration length

T_{explore}

). The length of the exploration phase arises from the requirement that all

N M

provider–worker utilities be estimated with uniform accuracy at the beginning of each epoch. According to Lemma 1, obtaining estimates within accuracy

ϵ_{τ}

and failure probability at most

η_{τ}

requires each provider–worker pair to be sampled a sufficient number of times. This naturally introduces factors involving N, M, and

u_{max}

in

T_{explore}

. These constants are a consequence of standard concentration inequalities and do not reflect computational overhead; hence, they do not affect the real-time execution or scalability of the algorithm.

3. Results

In this section, we first present a set of preliminary device-level experiments, followed by extensive simulations that demonstrate the effectiveness of the proposed algorithm compared with existing alternatives.

3.1. Preliminary Experiment

We first conduct a preliminary device-level experiment that characterizes both the communication cost of encoder outputs and the latency composition of heterogeneous execution.

We begin by assessing the communication benefit brought by multimodal encoders. A 4K-resolution (3840 × 2160) PNG image of approximately 5.4 MB is used as the representative input. As shown in Figure 3a, the output tensors generated by several visual encoders are significantly smaller than the original image, with lightweight backbones such as EfficientNet and ResNet50 achieving more than an order-of-magnitude reduction.

Next, we analyze the end-to-end latency using two computing platforms: a Jetson AGX Orin (64 GB) representing the low-power edge device, and a remote NVIDIA A100 (NVIDIA, Santa Clara, CA, USA) representing high-performance edge device. We adopt the LLaVA-1.5-7B model as a representative MLLM and measure the latency contributions from multimodal encoding, LLM inference, and data transmission. Figure 3b illustrates the latency decomposition under the two execution settings. For Jetson AGX Orin, the majority of the delay originates from LLM processing due to the limited computational capability of embedded hardware. In contrast, remote execution significantly accelerates LLM inference on the A100, but this performance advantage is offset by substantial transmission delay caused by sending large raw inputs to the server.

Overall, these results reveal a clear performance trade-off within heterogeneous edge environments. Lightweight edge devices can efficiently perform multimodal encoding and produce compact feature tensors, but they are unable to execute LLM inference with acceptable latency due to limited computational capability. Conversely, although powerful edge nodes can perform LLM inference with low latency, such nodes are typically scarce and often geographically distant from end devices, making it inefficient to transmit large raw inputs directly to them.

3.2. Simulation Experiment Settings

Building on the above device-level observations, we now conduct large-scale simulations to systematically evaluate how task allocation strategies behave under diverse provider and worker configurations. We use synthetic datasets for providers to simulate large language model and encoder tasks dispatched to workers. Each worker is modeled with a random performance metric drawn from a uniform distribution between 0.8 and 1.0. The number of time slots for the simulation is fixed at 500. To ensure statistical robustness, each simulation scenario is repeated 20 times, and we report the mean performance along with its standard deviation. We test different system scales by varying both the number of providers and the number of workers. Specifically, the number of providers is set to {5, 6, 7}, while the number of workers is varied as {5, 10, 20, 30}.

For the algorithms, we compare our DistMLLM algorithm with two baselines. HEU (heuristic baseline): a simple two-tier heuristic that assigns ENC tasks to weaker workers and LLM tasks to stronger workers based on their averaged performance ranks. UCB (multi-agent UCB) [17]: a decentralized UCB-based method in which each provider independently runs a UCB policy for both ENC and LLM subtasks.

3.3. Profit and Regret Analysis

Comparing LLM Profit: We first examine the averaged profit of LLM tasks, as shown in Figure 4a. At the beginning of the simulation (approximately the first 100 time slots), DistMLLM undergoes an explicit exploration phase, resulting in lower LLM profit compared with HEU. Since HEU follows a fixed assignment rule without any exploration cost, it initially achieves the highest profit among the three algorithms. However, once DistMLLM completes its exploration stage and enters stable exploitation, it quickly converges to a superior assignment strategy. After this transition, DistMLLM consistently outperforms both HEU and UCB for the remainder of the simulation, maintaining the highest time-averaged LLM profit.

The HEU baseline attains moderate performance by assigning LLM tasks to stronger workers based on averaged performance ranks. However, its static allocation prevents it from adapting to dynamic performance variations across workers. The UCB algorithm yields the lowest profit, as its decentralized exploration strategy causes each provider to independently explore the worker space, incurring substantial exploration overhead. These results demonstrate that DistMLLM more effectively leverages cross-layer cooperation to optimize LLM execution.

We next examine the time-averaged profit of ENC tasks, as shown in Figure 4b. Although the absolute profit scale of ENC tasks is lower than that of LLM tasks, the three algorithms still exhibit a clear performance hierarchy. DistMLLM consistently achieves the highest ENC profit at all observed time points and continues to improve as the simulation progresses, indicating that its coordinated cross-layer scheduling also benefits the encoder side. UCB attains the second-best performance, with its profit steadily increasing but remaining below that of DistMLLM. In contrast, the HEU baseline yields the lowest ENC profit and remains almost flat over time, because its static split of workers into ENC tier and LLM tier cannot fully exploit worker heterogeneity or adapt to dynamic performance fluctuations. These results show that, even for relatively lightweight ENC tasks, adaptive multi-agent learning still brings a noticeable advantage over fixed heuristic assignment.

Comparing LLM and ENC Regret: We further examine the cumulative regret of the three algorithms for both LLM and ENC tasks, as shown in Figure 4c,d. Since regret characterizes the performance loss with respect to the oracle optimum, lower curves indicate more efficient online learning.

For LLM tasks, HEU suffers from the highest cumulative regret with a consistently steep growth rate, reflecting the persistent suboptimality of its static worker assignment. During the early and mid stages (roughly the first 400 time slots), DistMLLM incurs slightly higher regret than UCB due to its coordinated exploration. However, as learning progresses, DistMLLM gradually reduces its regret growth rate and eventually surpasses UCB, ending the simulation with the lowest overall LLM regret.

For ENC tasks, HEU again exhibits the largest cumulative regret. In the earlier stage of the simulation, UCB maintains the lowest regret among the three algorithms, slightly lower than DistMLLM. As time progresses, however, DistMLLM continues to refine its ENC–LLM joint assignment strategy and ultimately overtakes UCB in the final portion of the horizon, finishing with the lowest ENC regret overall. This demonstrates that although ENC tasks are lighter and more stable, cross-layer coordination still provides long-term advantages, allowing DistMLLM to achieve superior regret performance for both LLM and ENC tasks.

Provider-Level Performance under DistMLLM: Finally, Figure 5 illustrates the time-averaged ENC and LLM profit of the five providers under the DistMLLM algorithm. At the beginning of the simulation, DistMLLM conducts explicit exploration by cycling each provider across different workers. Because these assignments are not guided by reward estimates, the provider-level profits exhibit noticeable fluctuations.

After exploration concludes, the algorithm switches to exploitation. It constructs an allocation mapping using the performance estimates gathered so far and maintains this mapping for an extended period. From this point onward, the profit curves stabilize and gradually converge. Providers that consistently match well with high-performing workers achieve higher long-term averages, while others converge to lower levels based on their estimated utility profiles.

A clear contrast emerges between ENC and LLM results. ENC profits show relatively small differences among providers, which is consistent with their lower task value and weaker sensitivity to worker heterogeneity. In comparison, LLM profits exhibit a wider spread because LLM tasks rely more heavily on worker performance and carry higher intrinsic value. As a result, variations in worker suitability translate more strongly into provider-level differences during convergence.

Across all experiments in this section, DistMLLM demonstrates clear and consistent advantages over both HEU and UCB in terms of profit and regret. For LLM and ENC tasks alike, DistMLLM quickly transitions from exploration to stable exploitation and achieves the highest long-term averaged profit. The regret results further reinforce this observation: although DistMLLM may incur slightly higher regret in the early stage due to its structured exploration, it ultimately surpasses UCB and ends with the lowest cumulative regret in both task types.

3.4. Sensitivity Analysis

Beyond the main comparison with baseline algorithms, we conduct a set of ablation and sensitivity experiments to further examine how different system factors influence the performance of DistMLLM. These experiments evaluate (i) scalability with respect to the number of service providers and workers, (ii) the benefit brought by disaggregation, and (iii) the impact of tensor transmission cost. The results are summarized in Figure 6 and Figure 7.

Impact of the number of providers and workers. Figure 6a shows the averaged profit as the number of service providers increases from {5, 6, 7, 8, 9, 10}, while the number of workers is fixed at 5. Across all settings, DistMLLM consistently outperforms the decentralized multi-agent UCB baseline. As the number of providers increases under this fixed worker budget, the system becomes progressively congested, and the independent exploration conducted by UCB leads to more frequent collisions and inefficient worker utilization. In contrast, DistMLLM leverages its coordinated exploration–exploitation mechanism to mitigate contention and maintain more efficient worker allocation, resulting in significantly higher total profit and demonstrating strong scalability with respect to provider population.

Figure 6b shows the averaged profit as the number of workers varies over {5, 10, 20, 30}, while the number of service providers is fixed at 10. As more workers become available, both algorithms initially benefit from the increased computational capacity; however, DistMLLM exhibits a noticeably larger performance improvement. When the worker pool is small, contention among providers is high, and the decentralized exploration conducted by the multi-agent UCB baseline leads to frequent conflicts and inefficient worker allocation.

A notable observation is that, after reaching the peak at around 10 workers, the performance of both algorithms slightly decreases as the worker pool continues to expand. This decline occurs because excessively large worker pools dilute the effective exploration signals: each provider requires more time to sample all available workers, causing slower convergence to high-reward assignments. Additionally, when many low-quality workers are introduced, the probability of sampling suboptimal workers increases, temporarily dragging down the averaged total profit. Despite this downward trend at large worker counts, DistMLLM remains consistently superior to UCB.

Impact of tensor transmission cost and disaggregation. Figure 7a evaluates how varying the communication cost between encoder and LLM components affects performance. As the transmission cost increases, the profit of both algorithms decreases, which is expected since offloading encoders and LLMs to different workers becomes less attractive. However, DistMLLM exhibits a much slower degradation curve, maintaining a consistent advantage over the UCB baseline. This demonstrates that the disaggregated decision mechanism of DistMLLM is more robust to communication overhead.

In addition, to isolate the benefit of MLLM disaggregation, Figure 7b compares DistMLLM with a monolithic version in which each provider must assign both the encoder and LLM to the same worker. The monolithic design initially achieves a stable profit but is unable to fully exploit worker heterogeneity. DistMLLM, after completing its exploration phase, rapidly converges to higher profit by independently placing the encoder and LLM on different workers when advantageous. This illustrates that decoupling the two components provides clear performance gains—especially in heterogeneous environments where worker strengths differ significantly across encoder and LLM tasks.

Overall, these results highlight the importance of coordinated exploration and joint encoder–LLM allocation in heterogeneous edge environments and confirm that disaggregation is a key enabler of efficient MLLM inference on the edge.

4. Conclusions

In conclusion, this work presents DistMLLM, a framework for deploying multimodal large language model services in heterogeneous edge environments through a disaggregated execution strategy. Firstly, by separating the encoding of multimodal data and the inference of LLM across lightweight and high-performance edge devices, DistMLLM leverages the complementary hardware capabilities of heterogeneous platforms. This significantly reduces end-to-end latency and transmission overhead by using compact tensor representations, enabling realistic deployment on resource-constrained edge nodes. Secondly, in order to address dynamic workloads and competing service provider interests, DistMLLM integrates a multi-agent bandit algorithm that continuously estimates device utilities and allocates tasks jointly across encoders and LLM workers. Thirdly, extensive experiments demonstrate that, by combining realistic device-level measurements and large-scale simulations, DistMLLM achieves higher long-term profit, lower cumulative regret and more stable performance than representative multi-agent UCB baselines and heuristic baseline. Furthermore, it shows strong robustness to variations in communication costs and heterogeneous device capabilities. These results highlight the promise of disaggregated execution coupled with adaptive bandit-based scheduling for practical MLLM deployment in real-world edge computing systems.

Author Contributions

Conceptualization, X.Y.; literature review, X.Y. and H.C.; methodology, X.Y., H.C., L.L. and H.L.; formal analysis, X.Y., H.C., L.L. and H.L.; writing—original draft preparation, X.Y., H.C., L.L. and H.L.; visualization, X.Y. and H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by JSPS KAKENHI Grant Numbers JP23K11063 from Japan Society for the Promotion of Science.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this article are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020; pp. 1877–1901. [Google Scholar]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 240. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Meskó, B. The impact of multimodal large language models on health care’s future. J. Med. Internet Res. 2023, 25, e52865. [Google Scholar] [CrossRef] [PubMed]
Cui, C.; Ma, Y.; Cao, X.; Ye, W.; Zhou, Y.; Liang, K.; Chen, J.; Lu, J.; Yang, Z.; Liao, K.-D.; et al. A survey on multimodal large language models for autonomous driving. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 3–7 January 2024; pp. 958–979. [Google Scholar] [CrossRef]
Liu, L.; Zhao, Z.; Feng, J.; Xu, F.; Zhang, Y.; Pei, Q.; Xiao, M. Distributed collaborative computing for task completion rate maximization in vehicular edge computing. IEEE Trans. Intell. Transp. Syst. 2025, 26, 18070–18082. [Google Scholar] [CrossRef]
Li, H.; Ota, K.; Dong, M. Learning IoV in 6G: Intelligent edge computing for Internet of Vehicles in 6G wireless communications. IEEE Wirel. Commun. 2023, 30, 96–101. [Google Scholar] [CrossRef]
Yu, W.; Liang, F.; He, X.; Hatcher, W.G.; Lu, C.; Lin, J.; Yang, X. A survey on the edge computing for the Internet of Things. IEEE Access 2018, 6, 6900–6919. [Google Scholar] [CrossRef]
Tao, M.; Liao, L.; Zhang, Y.; Liu, L.; Min, G.; Niyato, D.; Dustdar, S. EDT-SaFL: Semi-asynchronous federated learning for edge digital twin in Industrial Internet-of-Things. IEEE Trans. Mob. Comput. 2026, 25, 674–690. [Google Scholar] [CrossRef]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual Event, 18–24 July 2021; Volume 139, pp. 4904–4916. [Google Scholar]
Hu, C.; Li, B. Distributed inference with deep learning models across heterogeneous edge devices. In Proceedings of the IEEE INFOCOM 2022—IEEE Conference on Computer Communications, London, UK, 2–5 May 2022; pp. 330–339. [Google Scholar] [CrossRef]
Liu, Y.; Mao, Y.; Liu, Z.; Ye, F.; Yang, Y. Joint task offloading and resource allocation in heterogeneous edge environments. IEEE Trans. Mob. Comput. 2024, 23, 7318–7334. [Google Scholar] [CrossRef]
He, Y.; Fang, J.; Yu, F.R.; Leung, V.C. Large language models (LLMs) inference offloading and resource allocation in cloud-edge computing: An active inference approach. IEEE Trans. Mob. Comput. 2024, 23, 11253–11264. [Google Scholar] [CrossRef]
Huang, H.; Du, Y.; Zhan, W.; Duan, H.; Peng, K.; Cheng, Y.; Ye, Y.; Zhao, Z. Dynamic model deployment, batch scheduling, and resource allocation in MLLM-enabled edge–cloud networks: A multiagent two-timescale DRL approach. IEEE Internet Things J. 2025, 12, 50818–50835. [Google Scholar] [CrossRef]
Ouyang, T.; Zhao, K.; Zhang, X.; Zhou, Z.; Chen, X. Dynamic edge-centric resource provisioning for online and offline services co-location. In Proceedings of the IEEE INFOCOM 2023—IEEE Conference on Computer Communications, New York, NY, USA, 17–20 May 2023; pp. 1–10. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, S.; Yan, Y.; Jin, Y.; Chen, N.; Ji, M.; Xiao, M. Crowd2: Multi-agent bandit-based dispatch for video analytics upon crowdsourcing. In Proceedings of the IEEE INFOCOM 2023—IEEE Conference on Computer Communications, New York, NY, USA, 17–20 May 2023; pp. 1–10. [Google Scholar] [CrossRef]
Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 2002, 47, 235–256. [Google Scholar] [CrossRef]

Figure 1. Comparison of different devices for LLM and ENC.

Figure 2. DistMLLM for multimodal large language model service.

Figure 3. Comparison of encoder tensor sizes and latency breakdowns. (a) Output tensor sizes for different visual encoders. (b) Latency components of LLM processing, multimodal encoding, and transmission under Jetson AGX Orin and A100.

Figure 4. Performance comparison of different algorithms for LLM and ENC tasks (error bars and shaded areas indicate the standard deviation over 20 runs).

Figure 5. Averaged ENC and LLM profit of individual providers under the DistMLLM algorithm.

Figure 6. Analysis of MLLM task performance across providers and workers (shaded areas indicate the standard deviation over 20 runs).

Figure 7. Evaluating the effect of disaggregation and tensor transmission cost (shaded areas indicate the standard deviation over 20 runs).

Table 1. Summary of equipment parameters.

Devices Information
Name	GPU	RAM	Storage	Power
Jetson Orin Nano	1024 CUDA	8 GB	microSD/SSD	5–15 W (Configurable)
Jetson AGX Orin	2048 CUDA	64 GB	NVMe SSD	15–60 W (Configurable)
RTX 4090	16,384 CUDA	24 GB	NVMe SSD	450 W

Table 2. List of notations.

Notation	Description
M	Set of MLLM service providers
N	Set of edge workers
T	Set of discrete time slots
$x_{m, t}$	Edge worker selected by service provider m at time slot t
$l_{m, n}$	Latency of the edge worker n when chosen by service
	provider m
$e_{m, n}$	Energy consumption for service provider m when choosing
	edge worker n
$r_{m, n}$	Reward for service provider m when choosing edge
	worker n
$c_{m}$	Computation demand of service provider m in each
	time slot
$C_{n}$	Computing capacity of edge worker n in each time slot
$d_{m}$	Input data size for service provider m
$p_{n}$	Device performance of edge worker n

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yuan, X.; Chen, H.; Liu, L.; Li, H. DistMLLM: Enhancing Multimodal Large Language Model Serving in Heterogeneous Edge Computing. Sensors 2025, 25, 7612. https://doi.org/10.3390/s25247612

AMA Style

Yuan X, Chen H, Liu L, Li H. DistMLLM: Enhancing Multimodal Large Language Model Serving in Heterogeneous Edge Computing. Sensors. 2025; 25(24):7612. https://doi.org/10.3390/s25247612

Chicago/Turabian Style

Yuan, Xingyu, Hui Chen, Lei Liu, and He Li. 2025. "DistMLLM: Enhancing Multimodal Large Language Model Serving in Heterogeneous Edge Computing" Sensors 25, no. 24: 7612. https://doi.org/10.3390/s25247612

APA Style

Yuan, X., Chen, H., Liu, L., & Li, H. (2025). DistMLLM: Enhancing Multimodal Large Language Model Serving in Heterogeneous Edge Computing. Sensors, 25(24), 7612. https://doi.org/10.3390/s25247612

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DistMLLM: Enhancing Multimodal Large Language Model Serving in Heterogeneous Edge Computing

Abstract

1. Introduction

2. Materials and Methods

2.1. Problem Formulation

2.2. Algorithm Design

2.3. Mapping upon Multi-Knapsack Problem

2.4. Algorithm Analysis

3. Results

3.1. Preliminary Experiment

3.2. Simulation Experiment Settings

3.3. Profit and Regret Analysis

3.4. Sensitivity Analysis

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI