AoI-Aware Optimization of Service Caching-Assisted Offloading and Resource Allocation in Edge Cellular Networks

The rapid development of the Internet of Things (IoT) has led to computational offloading at the edge; this is a promising paradigm for achieving intelligence everywhere. As offloading can lead to more traffic in cellular networks, cache technology is used to alleviate the channel burden. For example, a deep neural network (DNN)-based inference task requires a computation service that involves running libraries and parameters. Thus, caching the service package is necessary for repeatedly running DNN-based inference tasks. On the other hand, as the DNN parameters are usually trained in distribution, IoT devices need to fetch up-to-date parameters for inference task execution. In this work, we consider the joint optimization of computation offloading, service caching, and the AoI metric. We formulate a problem to minimize the weighted sum of the average completion delay, energy consumption, and allocated bandwidth. Then, we propose the AoI-aware service caching-assisted offloading framework (ASCO) to solve it, which consists of the method of Lagrange multipliers with the KKT condition-based offloading module (LMKO), the Lyapunov optimization-based learning and update control module (LLUC), and the Kuhn–Munkres (KM) algorithm-based channel-division fetching module (KCDF). The simulation results demonstrate that our ASCO framework achieves superior performance in regard to time overhead, energy consumption, and allocated bandwidth. It is verified that our ASCO framework not only benefits the individual task but also the global bandwidth allocation.


Introduction
In recent decades, the Internet of things (IoT) has experienced rapid development and become ubiquitous in our daily lives. IoT devices have proliferated and evolved with advanced hardware architectures, and are being leveraged to create seamless networks that cover every corner of our globe [1]. Along with the development of IoT devices, a promising computing paradigm known as edge computing has arisen; this involves moving the location of computation from the central network to the network edge [2]. Moving the task execution from the cloud server to the multi-access edge computing (MEC) server (e.g., base station, access point) significantly alleviates the congestion of the core network and releases the burden of the cloud. Tasks with real-time requirements, computation-intensive characteristics, and high energy consumption (e.g., deep neural network (DNN)-based automatic license plate recognition) appear. Mobile devices where tasks are generated are constrained in terms of energy and computational capabilities (e.g., smartphones and unmanned aerial vehicles). Therefore, it is necessary to offload tasks to nearby MEC servers for remote execution [3], which is also known as computation offloading [4].
However, the exponential growth in the volume of offloaded data has led to increased traffic burdens on cellular networks, causing channel congestion. Under unstable network conditions, such as extremely high transmission latency, the performance of computation offloading can drastically decline. A caching policy [5] is proposed to tackle this issue by proactively storing the service in IoT devices, including MEC servers and mobile devices, to reduce the traffic of the cellular network. If an IoT device caches the service libraries and parameters, the task can be directly processed. Hence, the task processing time can dramatically reduce [6]. A DNN-based task is executed by a corresponding service package, consisting of reliable libraries and network parameters. Since the MEC server and mobile devices process distinct types of tasks, it is impractical to proactively cache all types of services due to storage limits. They only carry out caching whenever a task is required to be executed, and the caches are stored within a restricted time horizon.
Machine learning plays a significant role in the wireless network [7]. Considering a distributed machine learning scenario [8], the DNN is trained in a distributed manner. Then, the trained parameters of the DNN are assembled on an application server. The application server gathers all of the trained parameters and further trains a global DNN. Since the new data are generated from mobile devices, the trained parameters are updated ceaselessly and the global DNN is retrained based on the newly gathered parameters at the end of every global training round. Thus, the global DNN always reflects the up-to-date trained parameters. However, mobile devices may not fetch the latest parameters in every round. Hence, the cached DNN model may be outdated, which should be updated to keep the model fresh. To measure the freshness of the global service parameters at the MEC servers and the mobile devices, we introduce the concept of AoI [9], which is defined as the elapsed time since the generation of the latest received global service parameters response. The global service parameters are generated by training at the end of every global training round. When the MEC server or mobile device is required to execute inference tasks, it first checks whether fresh service parameters exist. If the service parameters are stale, the MEC server or mobile device needs to request the application server to fetch the up-to-date trained parameters for inference task execution.

Challenges
To realize distributed machine learning and service caching, the following challenges should be addressed:

Cost of the Task
On the one hand, the inference task completion time needs to be less than its corresponding maximal tolerance deadline. Thus, minimizing the inference task completion time is necessary for real-time requirements. On the other hand, the inference tasks are generated on energy-constrained mobile devices, which carefully make the offloading decisions to minimize energy consumption. Therefore, it is challenging to minimize the cost of the inference task consisting of time delay and energy consumed.

Bandwidth Consumption of the Application Server
If IoT devices fetch the latest service parameters from the application server, it utilizes the limited wireless bandwidth of cellular networks. Therefore, there is a trade-off between the fetching time and the total available bandwidth. If the application preferentially guarantees the fetching time, the remained bandwidth is not enough to serve other applications, and vice versa. Thus, the challenge of time and bandwidth trade-off needs to be addressed.

Matching between Wireless Channels and IoT Devices
In a condition of limited bandwidth, the matching between the wireless bandwidth and IoT devices is significant enough to minimize the fetching time since an IoT device may experience diverse channel fading and co-channel interference on different wireless channels. Hence, it is the third challenge to match between wireless channels and IoT devices to further minimize the service fetching time.

Related Work 1.2.1. Offloading with Cache
Some works make offloading decisions by considering the cache technology. In [10], an algorithm was devised by taking into account the multi-cast opportunity with cache in a multi-user scenario. A computing offloading and content caching model was proposed to reduce the time delay in the internet of vehicles in [11]. In [12], an optimal computing offloading and caching policy was designed to minimize the latency in a hybrid mobile system. In [13], an approximation collaborative computation offloading scheme and a game-theoretic collaborative computation offloading scheme were devised to achieve better offloading performance and scale well with the increasing computation task numbers. The above works do not consider the age of the cache, which may degrade the QoS.

Cache of Data
In terms of data caching, existing works focus on frequently reused data to improve performance. In [14], a deep supervised learning method was adopted to make real-time decisions in a dynamic vehicle network. An online caching placement and prediction-based data pre-fetch method were designed in [15] to address the uncertainty of future task parameters. In [16], a cache deployment strategy in a large-scale Wi-Fi system was adopted to maximize the caching benefit and achieve better caching performance. In [17], a joint power allocation-caching problem was formulated to maximize the downlink performance in the caching FiWi network. However, these works do not take into account the caching of the service, which is crucial in the DNN-based task.

Cache of Service
With respect to service caching, a few works consider caching services to enhance system efficiency. In [18], an online caching algorithm was proposed to minimize the overall computation delay. An extremely compelling (but much less studied) problem was studied in MEC-enabled dense cellular networks in [19]. In [20], an online service caching algorithm was devised to achieve the optimal worst-case competitive ratio under homogeneous task arrivals. In [21], a cache placement algorithm was adopted to minimize the data traffic forwarded to the remote cloud. The above-mentioned works only studied the cache and did not combine it with offloading.

Age of Information
In regard to the age of information, some works focused on minimizing the AoI of the optimized goal. In [9], the concept of AoI was first proposed, and general methods were derived to calculate the age metric, which can be applied to broad types of service systems. Dynamic cache content update scheduling algorithms were designed to minimize the average AoI of the dynamic content delivered to the users in [22]. In [23], a dueling deep R-network-based status updating algorithm was proposed by combining the dueling deep Q-network and R-learning to minimize the average cost. In [24], an algorithm aimed to obtain an optimal trade-off between age and latency was adopted for the freshness-aware buffer update in a mobile edge scenario. However, these works did not leverage the AoI metric to improve the offloading performance in an edge system.

Contribution
In this paper, we consider an AoI-aware service caching-assisted offloading scenario. Our objective is to minimize the weighted sum of the average completion delay, energy consumption, and allocated bandwidth. We decompose the original problem into three subproblems: minimizing the average time overhead cost and energy consumption of inference tasks, minimizing the required average bandwidth, and minimizing the fetching time of responding IoT devices. Furthermore, to solve the subproblems, we propose the AoI-aware service caching-assisted offloading framework (ASCO) to deal with them, which consists of three modules: the method of Lagrange multipliers with the KKT condition-based offloading module (LMKO), the Lyapunov optimization-based learning and update control module (LLUC), and the Kuhn-Munkres (KM) algorithm-based channel-division fetching module (KCDF). Simulation results show that our ASCO framework achieves superior performance compared to other baseline combinations in terms of time overhead, energy consumption, and allocated bandwidth. The main contributions of the paper are summarized as 1.
To minimize the average time overhead cost and energy consumption of inference tasks, we transform the problem into a Lagrangian dual problem. Then, we propose the LMKO module based on the method of Lagrange multipliers with Karush-Kuhn-Tucker (KKT) conditions to make an optimal offloading decision.

2.
To minimize the required average bandwidth, we transform the problem into a Lyapunov plus penalty problem by minimizing the total required bandwidth while keeping the requesting data queue backlog stable. Further, we propose the LLUC module based on the Lyapunov optimization to derive an optimal dequeued rate.

3.
To minimize the fetching time of IoT devices, we consider the problem of finding the perfect matching by maximizing the sum of the link weights in the equalling subgraph. Moreover, we propose the KCDF module based on the KM algorithm to obtain the optimal matching decision.
The novelty of the paper consists of three aspects. First we propose an AoI-aware service caching-assisted offloading scenario, which has not been considered in the literature. This scenario takes into account the service caching in distributed machine learning, including the service libraries and parameters. It is a popular technology and worthy to be investigated. We also consider the freshness of the service caching for the computation offloading. Existing works omit the AoI of the caching, especially the service caching, which degrades the offloading performance. We aim to minimize the costs from both the mobile device side and the global perspective. Then, we propose the novel ASCO framework, including three modules. The proposed algorithm outperforms the existing baselines.
The rest of the paper is organized as follows. We elaborate on the system module in Section 2. An analysis of the formulated problems is detailed in Section 3. Section 4 presents the proposed solution. The evaluation simulation is described in Section 5, followed by the conclusion in Section 6.

System Model
We consider an AoI-aware service caching asymmetric network consisting of heterogeneous mobile devices and MEC servers in Figure 1. A set of |N | mobile devices indexed by n is denoted as N = {1, · · · , |N |}, e.g., smartphones and intelligent vehicles. A set of |M| MEC servers indexed by m is denoted as M = {1, · · · , |M|}, e.g., access points and base stations. Since an AI-based inference task generated from the mobile device is computationintensive and has real-time requirements, the mobile device with constrained computation capability needs to offload the inference task to the MEC server with sufficient computation resources. On the one hand, the inference task is processed by the corresponding service. For instance, an image recognition inference task is inferred by a DNN service running in service libraries, e.g., machine learning frameworks. On the other hand, caching data can alleviate the transmission traffic during the offloading and include content caching and service caching.
Considering a distributed machine learning scenario, an application server periodically trains an up-to-date DNN and then distributes it to the MEC servers and mobile devices, among which the inference task data are hardly reusable while the DNN service is frequently reusable. Thus, different from cashing the inference task data, caching the DNN service significantly reduces the transmission time. Note that the DNN service consists of the service libraries and service parameters. Since a DNN with the latest parameters owns better inference accuracy based on the periodical training, the service parameters should be updated when it has a new version. The service libraries are static and are only transmitted once for caching while the service parameters are dynamic. We define the AoI as the elapsed time since the generation of the latest received service parameters at MEC servers or mobile devices, which measures the freshness of service parameters. If the AoI of service parameters is less than the periodical training round, then the parameters are considered to be the latest version and fresh enough to be used for inference. Otherwise, since a new version is generated at the application server, the parameters are stale and need to be updated to the latest version. Note that mobile devices do not hold the AoI information on the side of MEC servers due to privacy concerns and transmission overhead.

Task Model
Considering a time-slotted system, a set of |T | timeslots indexed by t is denoted as T = {1, · · · , |T |}. The inference task generated from the mobile device n at timeslot t is denoted as k n (t). A set of |J | service types indexed by j is denoted as J = {1, · · · , |J |}. For instance, plate image recognition and face recognition are distinct types of services. Each inference task has a corresponding service type; the relationship is represented as follows: x typ k n (t),j = 1 if k n (t) is of type j; otherwise, x typ k n (t),j = 0. This satisfies the condition that each inference task can only be executed by a service of a certain type, at most: ∑ j∈J x typ k n (t),j = 1. The inference tasks are computation-intensive and have maximum time tolerance. We define the inference task profile of k n (t) as (d k n (t) , c k n (t) , T max k n (t) ), where d k n (t) is the inference task input size, c k n (t) is the task computation amount, and T max k n (t) is the task maximum completion tolerance deadline. Take the task image recognition as an example, d k n (t) is the image bit size, and c k n (t) represents the required CPU cycles of the DNN service. T max k n (t) is the image recognition deadline, meaning that the inference task processing delay cannot exceed the tolerance time.

Communication Model
The application server, mobile devices, and MEC servers mutually communicate under the cellular network. Based on the Shannon theory, the transmission rate between the mobile device n and MEC server m can be referred to as r n,m (t) = b n,m (t) log 2 (1 + p n,m (t)h n,m (t) where b n,m (t) is the allocation bandwidth, p n,m (t) means the transmission power from n to m, h n,m (t) is the channel gain, σ 2 represents the additive white Gaussian noise, and I n,m (t) is the co-channel interference that mobile device n suffers on the cellular channel, respectively. The transmission power affects the achievable spectral efficiency, and a highly allocated bandwidth can lead to an efficient transmission rate. The channel gain between each one varies due to mobility. Since mobile devices are energy-constrained, the transmission power has an upper bound: p n,m (t) ≤ p max , where p max is the maximum transmission power. Moreover, the uploading time of inference task k n (t) from the mobile device n to the MEC server m can be calculated as where x exe m,k n (t) (t) is the offloading decision and defined as x exe m,k n (t) (t) = 1 if k n (t) is offloaded to m; otherwise, x exe m,k n (t) (t) = 0. It satisfies: ∑ m∈M x exe m,k n (t) (t) ≤ 1, meaning that each inference task is offloaded to one MEC server at most.
The transmission rate from the application server to the MEC server r 0,m (t) and the transmission rate from the application server to the mobile device r 0,n (t) can be similarly calculated with (1).

Caching Model
For the purpose of alleviating the transmission traffic, the MEC servers and mobile devices have to cache the DNN service in their caching storage if they have no corresponding cache. Let d lib j be the library size and d par j be the parameter size of service type j, respectively.
Therefore, the fetching time for the DNN service of type j to the MEC server m can be calculated as where T lib 0,m,j and T par 0,m,j are the fetching times of the service libraries and parameters at the MEC servers, respectively, and x cac m,j (t) is the service caching placement decision at the MEC server m, defined as x cac m,j (t) = 1 if j is cached in m; otherwise, x cac m,j (t) = 0. Similarly, the fetching time for the DNN service of type j to mobile device n can be represented as T cac 0,n,j . Due to the limited caching capacity of the MEC server, there is a constraint on the storage cache: where the total DNN service size of all types cannot exceed the storage upper bound d max m . The total DNN service size in mobile devices has a similar constraint.
Fresh parameters can effectively infer the DNN task with the satisfied performance. To measure the freshness of the DNN service parameters, we introduce the concept of AoI to quantify the age in the MEC server m: where t j is the timeslot of the latest periodical training of service type j. The same calculation of ∆ n,j (t) is in mobile devices. The updating mechanism at the MEC server m can be defined at mobile device n: ∆ n,j (t) = d par j r 0,n (t) if fetching ends at timeslot t; otherwise, ∆ n,j (t) = ∆ n,j (t − 1) + 1. Since the DNN is trained periodically in the application server, the DNN training round of type j can be denoted as T int j . If the AoI of the parameters is less than the training round, the parameters can be regarded as fresh parameters. Let x fre m,j (t) be the service parameter freshness status, defined as follows: or x fre n,j (t) = 0, the MEC server or the mobile device is required to fetch an up-to-date version of the service parameters from the application server; the fetching times are T par 0,m,j and T par 0,n,j , respectively.

Execution Model
In terms of execution, the inference task is executed under the existence of the corresponding service. If there is no DNN service caching at the MEC server or mobile device, they are required to fetch a DNN service cache and then further carry out the execution. After fetching the service caching, the execution delay of inference task k n (t) at the MEC server m is calculated as where f m (t) is the computation capability of the MEC server m. Likewise, the execution delay at the mobile device n is calculated as where f n (t) is the constant computation capability of the mobile device n. Here, the computation capability of a mobile server is less than a MEC server, and f m (t) has an upper bound: where f max is the maximum of the computation capability.

Energy Model
From the perspective of energy consumption, we focus on the energy of mobile devices since they usually have batteries of limited capacity while the MEC server is connected to the power grid. Hence, the energy consumed for the local execution of the mobile device n can be calculated as where µ refers to the effective switched capacitance.
In the case of offloading, the energy consumption of the mobile device only includes the uploading energy, calculated as Energy consumption is another crucial metric of mobile devices. The cost of the mobile device consists of the time delay and energy consumption with distinct emphasis.

Cost Model
At timeslot t, the mobile device n with the generated DNN inference task k n (t) can make an offloading decision to process the task. According to the service caching placement decision and service parameter freshness status, the cost of the mobile device can be divided into the following cases, as seen in Figure 1.

Case 1: Offloading with Fresh Cache
First, in case 1, the mobile device offloads the inference task to the MEC server with caching service libraries and fresh parameters. The combination of the decision and status satisfies: The total time delay, in this case, can be calculated as T k n (t),1 = T upl n,m,k n (t) + T exe m,k n (t) . In addition, the total energy consumption of the mobile device is represented as E k n (t),1 = E upl n,m,k n (t) .

Case 2: Offloading with Stale Cache
In Case 2, the mobile device offloads the inference task to the MEC server with caching service libraries and stale parameters. The combination of the decision and status satisfies: The total time delay, in this case, can be calculated as T k n (t),2 = T upl n,m,k n (t) + T par 0,m,j + T exe m,k n (t) . Moreover, the total energy consumption of the mobile device is denoted as E k n (t),2 = E upl n,m,k n (t) .

Case 3: Offloading without Cache
Then, in case 3, the mobile device offloads the inference task to the MEC server without any DNN service cache. The combination of the decision and status satisfies The total time delay, in this case, can be calculated as follows: . Likewise, the total energy consumption of the mobile device is also represented as E k n (t),3 = E upl n,m,k n (t) .

Case 4: Local Execution with Fresh Cache
For local execution, in case 4, the mobile device locally executes the inference task with caching service libraries and fresh parameters. The combination of the decision and status satisfies x k n (t), The total time delay, in this case, can be calculated as follows: T k n (t),4 = T exe n,k n (t) . In addition, the total energy consumption of the mobile device is denoted as E k n (t),4 = E exe n,k n (t) .

Case 5: Local Execution with Stale Cache
In case 5, the mobile device locally executes the inference task with caching service libraries and stale parameters. The combination of the decision and status satisfies x k n (t), The total time delay, in this case, can be calculated as T k n (t),5 = T par 0,n,j + T exe n,k n (t) . Then, the total energy consumption of the mobile device is calculated as E k n (t),5 = E exe n,k n (t) .
2.6.6. Case 6: Local Execution without Cache Finally, in case 6, the mobile device locally executes the inference task without any DNN service cache. The combination of the decision and status satisfies x k n (t),6 (t) = (1 − ∑ m∈M x exe m,k n (t) (t))x typ k n (t),j (1 − x cac n,j (t)) = 1. The total time delay, in this case, can be calculated as T k n (t),6 = T lib 0,n,j + T par 0,n,j + T exe n,k n (t) . Similarly, the total energy consumption of the mobile device is also denoted as E k n (t),6 = E exe n,k n (t) .

Problem Formulation
In the AoI-aware caching-assisted asymmetric offloading scenario, the average cost of the mobile device and the total bandwidth between the application server and MEC servers or mobile devices should be considered due to their crucial effectiveness. On the one hand, minimizing the average cost of the mobile device can ensure that the real-time requirements of the generated inference tasks are met and the battery energy is conserved. As the consumed bandwidth of the application server is limited, while it bears other realtime inference tasks, it is required to minimize the total bandwidth consumption between the application server and MEC servers or mobile devices. Accordingly, the average cost of time completion delay is as follows: x k n (t),i (t) = 0, and satisfies that each inference task must be executed via one of the cases at one MEC server or local mobile device: Then, the average cost of energy consumption can be denoted as the time average global allocation bandwidth between the application server and MEC servers or mobile devices is denoted as b 0 . Therefore, we formally formulate the original problem to minimize the time average global allocation bandwidth and the average cost of the mobile device consisting of inference task completion delay and energy consumption: where x k n (t),i (t), f m (t), and p n,m (t) are optimization variables. ξ ban , ξ tim , and ξ ene are the given weights of the average global AoI, average time cost, and average energy cost, respectively. (13) and (14) indicate that the inference task completion time delay and consumed energy have upper bounds. According to (15), each inference task has to be executed via (at most) one case at one MEC server or local mobile device. (16)- (18) show that the optimization variables are binary. (19) and (20) constrain the caching capacity limit of heterogeneous services at the MEC server or the mobile device. (21) shows that the computation capability of the MEC server is higher than the mobile device and has a maximum. (22) restricts the upper bound of the uplink transmission power of the mobile device.
x exe m,k n (t) (t), x cac m,j (t), and x fre m,j (t) are discrete binary integer variables; p n,m (t) and f m (t) are continuous variables. The objective functions are not linear to the variables, which are coupled mutually. Therefore, problem (12) is an MINLP problem known as NP-hard. It is difficult to solve the problem within the polynomial time. Combining the practical asymmetric environment, it is more challenging to analyze and propose a solution.

Average Cost Minimization Problem
From the perspective of mobile devices, we first decompose problem (12) into a problem to minimize the average cost of mobile devices: where mobile devices make their decisions based on the weighted sum of time delay and energy consumption.

Bandwidth Consumption Minimization Problem
From the perspective of the application server, the total bandwidth allocated to the requested MEC server or mobile device is constrained when it transmits the requested service data. We secondly decompose problem (12) into a problem minimizing the consumed bandwidth of the application server: where b max 0 is the total allocated bandwidth upper bound of the application server at one timeslot.

Service Fetching Time Minimization Problem
When the application server transmits the service data to the MEC servers or mobile devices, the total transmission time of the responding service data can be minimized based on the total allocated bandwidth. We further formulate problem (26): x k n (t),2 (t) + x k n (t),3 (t) + x k n (t),5 (t) + x k n (t),6 (t) = 1, where T fet (t) is the total transmission time of the responding service data. (27) indicates that the total allocated bandwidth has an upper bound, (28) and (29) limit the combination decisions.
Here, we clarify the connections among these three subproblems and how they can work together to reach the optimal solution for problem (12). Problem (12) jointly minimizes the cost of mobile devices and the global allocation bandwidth. Firstly, problem (23) minimizes the mobile device cost, including the time delay and energy consumption. Secondly, problem (24) minimizes the time average allocation bandwidth from a global perspective. Thirdly, problem (26) further minimizes the responding service transmission time after making the offloading decision based on the solution of the problem (23).

Solution
In this section, we propose three modules to, respectively, solve the subproblems in the last section. In particular, the LMKO module can minimize the average cost of mobile devices. To minimize the consumed bandwidth of the application server, we devise the LLUC module. Moreover, the KCDF module minimizes the total transmission time of the responding service data.

Method of Lagrange Multipliers with the KKT Condition-Based Offloading Module (LMKO)
To minimize the average cost of mobile devices, we transform problem (23) into a problem of tractable form and further leverage convex optimization to solve it. According to constraint (13), the time delay of each case cannot exceed the inference task completion tolerance deadline. Hence, we set the time delay of the case with most of the procedures to the maximum tolerance time to reduce the number of optimization variables: T k n (t),3 = T max k n (t) . For succinct expression, we define notations A and B as Then, f m (t) and p n,m (t) can be transform into the function values of g 1 (T exe m,k n (t) (t)) and g 2 (T exe m,k n (t) (t)), respectively: Moreover, the cost of the objective function in problem (23) can be calculated as Therefore, problem (23) can be further transformed into: where T exe m,k n (t) (t) and x k n (t),i (t) are optimization variables. Constraint (36) reflects the computation capability limit and T exe m,k n (t) (t). (37) constrains the relationship between the maximum power and T exe m,k n (t) (t). (38) indicates that the decision combination is relaxed to be continuous.
Subsequently, we leverage the method using Lagrange multipliers with KKT conditions [25] to solve problem (35). Before this, we prove that the problem (35) is convex. Now, we define another function of T exe m,k n (t) (t) as follows: and take the second partial derivative of g 3 (T exe m,k n (t) (t)) with respect to T exe m,k n (t) (t): All of the terms in (40) are positive, m,kn (t) 2 > 0, and g 3 (T exe m,k n (t) (t)) is a convex function. Similarly, g 3 (x k n (t),i (t)T exe m,k n (t) (t)) is a convex function. Furthermore, we define a perspective function of g 3 (x k n (t),i (t)T exe m,k n (t) (t)) as According to (41), g 4 (x k n (t),i (t)T exe m,k n (t) (t), x k n (t),i (t)) is convex so that the objective function of the problem (35) is a convex function. Then, the second partial derivatives of g 1 (T exe m,k n (t) (t)) and g 2 (T exe m,k n (t) (t)) with respect to T exe m,k n (t) (t) are, respectively, calculated as (42) and (43): All terms in (42) and (43) are positive, m,kn (t) 2 > 0. Thus, constraint (36) and (37) are convex with T exe m,k n (t) (t). The feasible region of the Problem (35) is a convex set. We can derive that the problem (35) is convex. In addition, if p max and f max are high enough, we can find a feasible solution to making all of the constraints slack, hence satisfying the Slater condition. A convex problem that satisfies the Slater condition is sufficient for the problem and its dual problem to be strong. In other words, they have zero dual gap and their optimal solutions are equal.
Next, we define the Lagrangian relaxation function of the problem (35) as where λ m,k n (t),1 and λ m,k n (t),2 are the Lagrangian multipliers. The Lagrangian relaxation function relaxes the constraints of the Problem (35). Here, we formally transform Problem (35) into its dual problem: max where we first fix λ m,k n (t),1 , λ m,k n (t),2 and minimize L to obtain the infimum, then fix T exe m,k n (t) (t), x k n (t),i (t) and maximize the infimum. (46) indicates that the Lagrangian multipliers are positive.
We further detail the KKT condition of the problem (45): λ m,k n (t),2 (g 2 (T exe m,k n (t) (t)) − p max ) = 0, where where After iterations of the Newton method, we obtain the optimal solution T exe m,k n (t) * . Then, the optimal resource allocation f m (t) * and p n,m (t) * can be calculated according to (32) and (33), respectively.
Since problem (45) is convex with respect to the optimization variable, the update iteration can converge to the optimal solution, satisfying the following conditions: and ∑ ∞ τ sub =1 α m,k n (t),2 (τ sub ) 2 < ∞, where τ sub is the iteration index. There is proof in [26].
Based on the given service caching placement decision and service parameter freshness status, we can calculate the cost of local execution as follows: the offloading cost of the MEC server m is calculated as and the minimum offloading cost among all the MEC servers is calculated as where T * k n (t) and E * k n (t) are calculated according to f m (t) * and p n,m (t) * . The offloading decision can be further derived. If C loc k n (t) < C off m * ,k n (t) , x exe m * ,k n (t) (t) = 1, and x exe m,k n (t) (t) = 0, ∀m ∈ M\m * , the inference task is offloaded to the MEC server m * ; otherwise, x exe m,k n (t) (t) = 0, ∀m ∈ M, and the inference task is executed locally. The pseudo-code of the LMKO is shown in Algorithm 1. The complexity of LMKO is O(|M|(τ new + τ sub )) + |M|), where τ new and τ sub are the iteration numbers of the Newton method and the subgradient method, respectively.

Lyapunov Optimization-Based Learning and Update Control Module (LLUC)
Since the application server also bears other applications, its bandwidth resources are limited and need to be economized. In this subsection, we minimize the bandwidth consumption of the application server from the perspective of a global view while minimizing the service fetching time to accelerate the inference task processing.
The inference tasks are generated randomly, and the execution request in the offloading style or local style is a random event for the MEC server and mobile device. If there is no caching service or fresh service parameters, they call for the application server to fetch the service. Therefore, the fetching request is also random in terms of the application server, which has no a priori distribution. We regard the total requested service data size waiting for transmission as a queue, and leverage the Lyapunov optimization to solve the problem of stabilizing a randomly arriving queue system.
The application server transmits the requested service data as soon as possible to decrease the fetching time. At timeslot t, the total requested service data size can be defined as the enqueued rate: Moreover, let d deq be the dequeued rate, which is the total size of the service transmitted from the application server to the requesting MEC servers or mobile devices. Furthermore, the backlog of the queue can be defined as Q(t + 1) = max{Q(t) + d enq (t) − d deq (t), 0}, where the enqueued rate and dequeued rate can affect the queue backlog of the next timeslot. Then, we define the quadratic Lyapunov function as Y(t) = Q(t) 2 2 , and the Lyapunov drift can be denoted as ∆Y(t) = Y(t + 1) − Y(t). In addition, we define the penalty function of the Lyapunov optimization, which equals the total allocated bandwidth consumption for transmitting the requested service data at timeslot t: b 0 (t) = βd deq (t), where β is the simplified transformation coefficient.
We formally transform problem (24) into the Lyapunov optimization problem: where V(t) is the adaptive weight of the penalty, and (65) is the stable condition of the queue system, and b max 0 in (66) is the maximum of the total available bandwidth between the application server and all requesting MEC servers or mobile devices.

19:
end if 20: end for 21: end for 22: return x exe m,k n (t) (t) Theorem 1 ([27]). Assuming there are constants D ≥ 0, que > 0, V max ≥ 0, b max 0 > 0, such that for all t and all possible variables Q(t), the Lyapunov drift-plus-penalty condition holds that: where b min 0 is the minimum of b 0 (t).
Theorem 1 explains that when the Lyapunov drift-plus-penalty condition is met, the average queue backlog is at most O(V max ) complexity, and the average bandwidth is at most O( 1 V max ) above the maximum bandwidth. Hence, we find that there is a trade-off between the queue backlog and the bandwidth penalty, which is tuned by V(t).
In addition, since , we can derive the optimal controlled dequeued rate by taking the derivative with respect to d deq (t) and further setting it to 0: To improve the Lyapunov optimization, we first design an adaptive learning penalty weight method to adaptively adjust to V(t): where ζ is the learning rate of penalty weight, φ(t) = 1 represents the ratio of the missing tolerance time inference task number, and 1 is an indicator function. The emphasis on the bandwidth penalty is lowered as the ratio of the overtime inference tasks increases. When the ratio is alleviated, the weight of the bandwidth is set to be higher. Secondly, since the transmission data sizes among all the requested MEC servers or mobile devices are distinct, e.g., some request the service libraries and parameters while others only request the parameters, the application server can preferentially respond to request only to the service parameters to decrease the consumption of the bandwidth when the weight of the bandwidth penalty is high. Therefore, we devise a dequeued rate update mechanism. When V(t) > thr where thr is a given threshold of the penalty weight, the enqueued rate can be updated to: where the request transmission data sizes of cases 3 and 6 are assigned to timeslot t + 1 to alleviate the bandwidth penalty at the current timeslot t.
Thirdly, from the perspective of the service parameters with few timeslots until the next training, if the application server directly send the part of data, it can be requested again soon due to its stale service parameters. We further propose a freshness-aware transmitting method to reduce the service-requested frequency; the service parameters that will be trained soon are arranged to be transmitted at the end of their training. If the time condition satisfies ((t − t j )modT int j ) > ηT int j , where η is the given proportion of the training round and mod is an operator of taking the remainder, the dequeued rate is arranged as follows: where the service ready to be trained is arranged to be transmitted from timeslot t to timeslot t + T int j − ((t − t j )modT int j ). The pseudo-code of LLUC is shown in Algorithm 2. The complexity of the adaptive learning penalty weight method, dequeued rate update mechanism, and freshness-aware transmitting method are O(|N |), O(|M||N ||J |), and O(|M||N ||J |), respectively.

Algorithm 2 Lyapunov optimization-based learning and update control module (LLUC).
Require: enqueued rate d enq (t), initial queue backlog Q(1), transformation coefficient β, initial bandwidth penalty weight V(1), learning rate of penalty weight ζ, given threshold of penalty weight thr , given training round proportion η Ensure: dequeued rate d deq, (t) 1: for t = 1 to T do 2: Update V(t) based on the adaptive learning penalty weight method according to (71).

KM Algorithm-Based Channel Division Fetching Module (KCDF)
Since the application server transmits with the MEC servers or mobile devices under the cellular network, the cellular channel matching is crucial to reduce the total transmission time of requested service data d deq, (t).
The total transmission time of the dequeued requested service data can be denoted as x k n (t),2 (t)T par 0,m,j +x k n (t),3 (t)(T lib 0,m,j + T par 0,m,j )) + x k n (t),5 (t)T par 0,n,j + x k n (t),6 (t)(T lib 0,n,j + T par 0,n,j )), where M(t) and N (t) are the responding sets of MEC servers and mobile servers based on the dequeued service data, respectively.
First, we divide the total allocated bandwidth into two parts, one is allocated for transmitting cases 2 and 5, and another is allocated for cases 3 and 6 with more transmitted data. The allocated bandwidth divided method is designed as and b lib,par 0 where b par 0 (t) and b lib,par 0 (t) are the total allocated bandwidths of cases 2 and 5 and cases 3 and 6, based on their total transmitted data sizes, respectively.
Take cases 2 and 5 as an example, we defined the response set as S = M par (t) ∪ N par (t), where S is indexed by s and has cardinal number |S|, M par (t) and N par (t) are the responding sets with cases 2 and 5 of the MEC servers and mobile devices, respectively. Let A = {1, · · · , |A|} be the set of |A| cellular channels indexed by a. The matching decision of s and a can be defined as x mat s,a (t) = 1 if a is allocated to s; otherwise, x mat s,a (t) = 0, and it is constrained by: ∑ s∈S x mat s,a (t) = 1, ∀a ∈ A, ∑ a∈A x mat s,a (t) = 1, ∀s ∈ S, where each cellular channel is allocated for, at most, one MEC server or mobile device, and each MEC server or mobile device is assigned, at most, one cellular channel. Thus, the transmission latency of the service parameter from the application server to the MEC server or mobile device s over cellular channel a is: where j(s) is the service type transmitted for s, p s,a (t), h s,a (t), and I s,a (t) are the transmission power, channel gain, and co-channel interference under channel a to s, respectively.
Here, we formally formulate the problem to minimize the total transmission time of the service parameters: min x mat s,a (t) ∑ a∈A x mat s,a (t) = 1, ∀s ∈ S, x mat s,a (t) ∈ {0, 1}, ∀s ∈ S, a ∈ A, where x mat s,a (t) is the optimization variable. (81)-(83) are the constraints of the matching decision.
We leverage the KM algorithm [28] to solve the problem. The complete weighted link bipartite graph is defined as G = (S, A, < S, A >), where S and A are vertex sets of two sides, < S, A > is the link set, and the weight of its element is derived from our devised link-initialized method: where θ is a coefficient of the minimum service data transmission time to remove the unacceptable transmission time. The feasible vertex label is satisfied: w s + w a ≤ w s,a , where w s = min a∈A w s,a and w a = 0 are the vertex label of s and a in the KM algo-rithm. Let G mat = (S, A, < S mat , A mat >) be the equalling matching subgraph, satisfying w s + w a = w s,a , where the link set < S mat , A mat > is initialized to an empty set. The perfect matching of the equalling matching subgraph G mat can be denoted as M * , and we have the following theorem.

System Implementation
For system implementation, we implement the framework in a real-world collaborative edge system testbed that consists of a Raspberry Pi4 Model B board (with 1.5 GHz CPU, 4 GB memory) and a desktop (with an Intel 8 Cores i7-10700F 2.90 GHz CPU and 16 GB memory). Raspberry Pi serves as the application server. The desktop serves as the MEC servers and mobile devices. All devices are connected under a local wireless router. We use the transmission control protocol (TCP) socket programming for guaranteeing reliable communication over all devices in the environment.

Case Study
We present a simulation of the proposed framework on the edge system testbed through a real-world image analysis case study: automatic license plate recognition. In particular, we leverage the convolutional neural network (CNN) framework as a service developed in [30]: an ImageNet model VGG-16. The VGG-16 model is a deep CNN with 16 layers for image recognition tasks and is trained in a distributed machine learning style. We use the open-source automatic license plate recognition dataset (available online: https://platerecognizer.com (accessed on 6 May 2022)) to emulate the tasks generated by mobile devices.

Experiment Setup
We use simulations to compare the performance of the framework. The hyperparameters of the simulation are as follows: the input size of the task is in [2,10]  We also select a few representative strategies compared with the KCDF module.
• Hungary algorithm (HA) [29]: An algorithm is leveraged to solve the maximal matching problem of a non-weight bipartite graph. • Channel bandwidth allocated-based size (CBAS): An algorithm where the total bandwidth is allocated based on the responding service data size.
• Channel bandwidth allocated-based case (CBAC): An algorithm where the total bandwidth is allocated based on the requesting offloading case. • Uniform allocation of channel bandwidth (UACB): An algorithm where the total bandwidth is allocated uniformly.

LLUC Evaluation
We first investigate the LLUC module to compare the performance of the time averagetotal bandwidth under different learning rates of the penalty weight. From Figure 2a, it can be shown that our proposed LLUC module with ζ = 0.1 achieves the best result over the change of the requesting number. As ζ increases from 0.1 to 0.4, the performance degrades over all of the requesting numbers. A lower learning rate results in a relatively high penalty weight and the Lyapunov optimization minimizes the penalty. In the meantime, selecting a lower learning rate may lead to a higher backlog and further delay the response fetching time. Therefore, it is advisable to make a moderate selection to balance the bandwidth consumption and time overhead. Since using ζ = 0.2 only increases bandwidth by 24.0% and decreases the queue backlog by 53.3% compared to using ζ = 0.1 for 10 requests, we take ζ = 0.2 as the learning rate, considering the bandwidth and time.
Then, we studied the LLUC module to compare the performance of the average AoI of each response service data under distinct round proportions of request rearrangements. In Figure 2b, the simulation results show that when η = 0.8, i.e., when a 20% interval is left until the next training, the algorithm consistently achieves the lowest AoI regardless of the number of requests. As η increases from 0.8 to 0.95, the average AoI becomes higher. More requests are rearranged to another timeslot for service data with lower AoI. However, the time latency can deteriorate while the response timeslots are delayed. Therefore, considering the responding transmission time and AoI service data, which decrease the service fetching frequencies, we selected a medium proportion η = 0.9 to balance the trade-off, whose average AoI only increases by 23.8% and the time latency decreases by 36.2%, when comparing η = 0.8 under 10 requests.

KCDF Evaluation
From the perspective of the module KCDF, we first evaluate the performance of the average fetching time under different link-initialized coefficients. Figure 3a plots that the KCDF with θ = 1.8 outperforms other link-initialized coefficients as the responding number increases. It is concluded that a lower link-initialized coefficient can remove more unsatisfied links in the KM algorithms. As θ increases from 1.8 to 2.4, the average fetching time becomes higher. However, selecting a link-initialized coefficient that is too low such as θ = 1.5 can increase the probability that the MEC server or mobile device fails to find a link in the equalling matching subgraph, which can significantly decline the performance. To make a feasible balance trade-off, we select θ = 1.8 to guarantee the transmission time latency with at least a 2.3% performance improvement over the second-best result from θ = 1.5 in 10 responses. Secondly, we compare the average fetching time under distinct rearrange conditions in Figure 2b. It is illustrated that rearranging based on the link weight is not less than the maximum, i.e., w s,a | < s, a >∈ M * = max a∈A w s,a has the minimum average fetching time when the responding number varies. The fetching time increases from the maximum, the second maximum, and the third maximum. When the application server rearranges based on the weight of the link is not less than the third, its result is even worse than the non-rearrangement. If the link is not the worst choice of the MEC server or mobile device, it is better to respond at this timeslot; otherwise, it suffers a higher fetching time. We choose the rearranging condition if the link weight is not less than the maximum to the LLUC module, which results in a time reduction of at least 3.7% compared to the second maximum case for under 10 responses.

Average Cost Comparison
From Figure 4a, we investigate the average cost of the distinct baselines of the LMKO modules. Our LMKO module achieves the minimum of Z 1 under different requesting numbers. The LMKO module is capable of obtaining the minimum cost by offloading the task to the best MEC server or local execution. The second-best result belongs to the FCOP algorithm since the mobile device chooses a MEC server with a fresh cache to offload. The performances of the COP and OP are poor owing to their extra service fetching times. The worst result is brought by the LEFC since it does not take advantage of offloading in the edge system. The LMKO module is efficient in terms of the cost of the time delay and energy consumption with at least a 4.1% performance gain compared to the second-best result from FCOP (for under 10 requests). Figure 4b illustrates the performance of the time-averaged total allocated bandwidth of the baselines of the LLUC baselines. The proposed LLUC module has a superior result comparing other baselines while the requesting number increases. The efficiency of the LLUC module maintains a controlled queue backlog while minimizing the total allocated bandwidth. In the meanwhile, the TBP algorithm obtains the second minimum result since it takes a higher penalty weight but causes a longer queue backlog. The QBP algorithm preferentially considers the queue backlog, leading to a medium result. The FTB algorithm delivers a high performance despite a fixed total bandwidth, due to its large backlog. The worst result belongs to the QBE algorithm, which keeps the 0 queue backlog, even as the bandwidth significantly increases. The LLUC module outperforms other baselines in regard to queue backlog stability and total allocation bandwidth with at least a 19.7% performance gain compared to the second-best result from TBP under 10 requests.   Figure 4c displays the results of comparing the average fetching times of the baseline methods of the KCDF module. Our KCDF module exhibits superior performance across varying response numbers. The KCDF module finds a perfect match in the equalling matching subgraph, where each MEC server or mobile device matches its best-allocated cellular channel. The second lowest number belongs to the CBAS algorithm, which allocates the bandwidth according to the service data size. Each MEC server or mobile device can obtain the satisfied channel. The CBAC algorithm allocating the bandwidth with the offloading case suffers a similar situation with the CBAS and attains a medium result. The UACB has a poor result since it uniformly allocates the bandwidth leading to the response with service libraries and parameters having unsatisfied transmission latency. The HA algorithm suffers the worst result since it never updates the vertex label while it cannot find a link in the equalling matching subgraph, which results in a few vertices being matched, i.e., a few channels are allocated. Thus, our KCDF module has efficient performance in terms of average fetching time with at least a 2.2% performance gain compared to the second-best result from CBAS under 10 responses.

Average Time Cost of Baselines Combination
In Figure 5a, we investigated the performance of the average time cost of the baseline combinations, which includes our proposed ASCO framework (LMKO, LLUC, and KCDF modules) and other baselines achieving the competitive result consisting of FCOP, TBP, and CBAS algorithms. We can see that our ASCO framework always outperforms other baseline combinations while the time weight parameter changes, and the weights of energy and bandwidth remain. We minimize the average time cost by finding the most suitable offloading decision and allocating the best cellular channel. Other baseline combinations have declined results comparing our framework. LMKO+TBP+KCDF, LMKO+LLUC+CBAS, and LMKO+TBP+CBAS achieved moderate performance, as there was not a significant improvement in the modules. On the other hand, FCOP+LLUC+KCDF, FCOP+LLUC+CBAS, and FCOP+TBP+KCDF incurred higher costs, as their modules placed less emphasis on time concerns. Take FCOP+TBP+CBAS as an example, it had the worst performance due to the absence of the proposed modules. As ξ tim increased, the performance gap between our framework and another baseline combination enlarged, which explains why the proposed framework has a significant gain in terms of time delay with at least a 9.6% improvement compared to the second-best result from LMKO+TBP+KCDF under ξ tim = 0.1.  Figure 5b shows the performance of the average energy cost of the combinations. Our ASCO framework keeps the best result while the energy weight varies and the weights of the time and bandwidth are fixed. The LMKO module makes an economized energy offloading decision to save the energy consumption of mobile devices. At the same time, other baseline combinations with the LMKO module outperform other combinations without the LMKO module due to the consideration of energy in the LMKO module. LMKO+TBP+KCDF, LMKO+LLUC+CBAS, and LMKO+TBP+CBAS obtain middle performances due to the lack of energy concerns. FCOP+LLUC+KCDF, FCOP+LLUC+CBAS, and FCOP+TBP+KCDF have higher costs since their modules are not efficient in terms of costs. Similarly, FCOP+TBP+CBAS had the worst performance. Hence, the average energy result verifies that our proposed framework achieves superior performance with respect to energy consumption with at least a 2.8% improvement compared to the second-best result from LMKO+TBP+KCDF under ξ ene = 0.05. 5.6.6. Average Bandwidth Consumption of Baselines Combination Figure 5c illustrates the comparison of the average bandwidth consumption allocated from the application server under the baseline combinations. The proposed ASCO framework attains minimum results except in an extreme case with ξ ban = 40, while the weights of the time and energy are fixed. In this case, the emphasis on bandwidth allocation is extremely significant so that our framework only obtains the second-best performance while LMKO+TBP+KCDF has the best result. LMKO+LLUC+CBAS and LMKO+TBP+CBAS obtain middle performances because they do not well balance the total bandwidth and mobile device cost. FCOP+LLUC+KCDF, FCOP+LLUC+CBAS, FCOP+TBP+KCDF, and FCOP+TBP+CBAS always obtain the worse results due to the algorithm's inefficiency. However, the value of the bandwidth is impractical since it leads to relatively less consideration of the time delay and energy consumption. In most moderate-weight cases, our framework dominates other baseline combinations in regard to the average bandwidth consumption with at least a 6.2% improvement compared to the second-best result from LMKO+TBP+KCDF under ξ ban = 10.

Conclusions
In our work, we consider a scenario of AoI-aware service caching-assisted offloading. The proposed ASCO framework consists of three modules: (1) the LMKO module based on the method of Lagrange multipliers with KKT conditions. (2) The LLUC module based on the Lyapunov optimization. (3) The KCDF module based on the KM algorithm. The simulation results verify that the proposed ASCO framework outperforms other baseline combinations with respect to time overhead, energy consumption, and allocated bandwidth. The ASCO framework is efficient in the individual inference task and global bandwidth allocation and is viable to be practically deployed.
This work can be extended in several future directions. First, considering the proactive service caching, the MEC server can predict the offloading request and call for the application server for advanced fetching. Second, considering the task partition, if the tasks are partitioned before execution, the subtasks can be executed in distinct MEC servers or locally.

Conflicts of Interest:
The authors declare no conflict of interest.