Modeling the Decision and Coordination Mechanism of Power Ba tt ery Closed-Loop Supply Chain Using Markov Decision Processes

: With the rapid growth of the new energy vehicle market, e ﬃ cient management of the closed-loop supply chain of power ba tt eries has become an important issue. E ﬀ ective closed-loop supply chain management is very critical, which is related to the e ﬃ cient utilization of resources, environmental responsibility, and the realization of economic bene ﬁ ts. In this paper, the Markov Decision Process (MDP) is used to model the decision-making and coordination mechanism of the closed-loop supply chain of power ba tt eries in order to cope with the challenges in the management process, such as cost, quality, and technological progress. By constructing the MDP model for different supply chain participants, this paper investigates the optimization strategy of the supply chain and applies two solution methods: dynamic programming and reinforcement learning. The case study results show that the model can e ﬀ ectively identify optimized supply chain decisions, improve the overall e ﬃ ciency of the supply chain, and coordinate the interests among parties. The contribution of this study is to provide a new modeling framework for power ba tt ery recycling and to demonstrate the practicality and e ﬀ ectiveness of the method with empirical data. This study demonstrates that the Markov decision-making process can be a powerful tool for closed-loop supply chain management, promotes a deeper understanding of the complex decision-making environment of the supply chain, and provides a new solution path for decision-making and coordination in the supply chain.


Introduction
Against the background of globalized energy structure transformation and rising awareness of environmental protection, the new energy vehicle industry has ushered in unprecedented development opportunities.As a core component of new energy vehicles, the performance and life of power batteries directly affect the promotion and sustainable development of new energy vehicles.However, with the popularization of new energy vehicles, the treatment and recycling of used power batteries have gradually become the focus of social concern, and effective solutions are urgently needed.Taking a new energy vehicle manufacturer in China as a case study, this study discusses in depth the decisionmaking and coordination mechanism of the closed-loop supply chain of power batteries by collecting and analyzing relevant data, aiming to provide new ideas for scientific management of this environmental and resource issue.
This paper is based on the powerful theoretical tool of the Markov Decision Process (MDP), which has been widely used in many fields such as supply chain management, inventory control, transportation scheduling, etc.By describing the decision-making process in an uncertain environment, the MDP model provides a new perspective for solving the problems of supply chain management such as high recycling cost of power batteries, poor channels, and unstable quality.In this study, it is applied to the decision analysis of the closed-loop supply chain of power batteries, which not only designs an effective recycling network, pricing strategy, reuse decision, and inventory control scheme but also focuses on maximizing the interests of all parties of the supply chain in the dynamic and uncertain environment.
The purpose of this study is to construct a scientific decision-making model to provide theoretical and practical guidance for the management of the closed-loop supply chain of power batteries.Through case studies, this study verifies the validity and practicability of the constructed model, which provides decision-making support for relevant enterprises and helps to promote the efficient recycling and utilization of used power batteries.Meanwhile, the innovation of this study is that multiple decision makers and objectives in the supply chain are considered comprehensively, a decentralized decisionmaking mechanism and centralized coordination mechanism are designed, and the overall performance of the supply chain under different strategies is evaluated through simulation experiments.These research results not only enrich the theoretical system in the field of supply chain management but also provide new ideas and methods for the optimization and management of the closed-loop supply chain of power batteries, which is of great theoretical and practical significance.

Literature Review
This section mainly synthesizes the research status of closed-loop supply chain management of power batteries and the research on the application of the Markov decisionmaking process in supply chain management and points out the shortcomings and room for improvement of the research in this paper.

Research Status of Power Battery Closed-Loop Supply Chain Management
Closed-loop supply chain management for power batteries covers the entire life cycle of power batteries, including production, use, recycling, and reuse to final disposal, and involves multiple participants, such as manufacturers, distributors, consumers, recyclers, and reusers.The goal of managing the closed-loop supply chain of power batteries is to simultaneously meet consumer demand, achieve efficient reuse of resources, and minimize environmental impacts.
The research provides insights into the following aspects of closed-loop supply chain management for power batteries: In terms of recycling network design, the goal is to minimize the recycling cost and maximize the recycling efficiency through the use of mathematical planning, heuristic algorithms, and meta-heuristic algorithms, as in the case of the study of Jia Xianmin and Li Shaonan (2022) [1].In terms of recycling pricing strategies, theories such as game theory and mechanism design have been used to determine the prices or subsidies that recyclers offer to consumers in order to increase the recycling rate of used power batteries, as exemplified by the research conducted by Cai Xiaoqian and Lin Yiyi (2023) in this area [2].The research in this area is an example.In the area of reuse decision making, multi-attribute decision-making methods are used to select the best reuse route for used power batteries to ensure that the reuse value is maximized, which includes the work of Yang Shuai (2023) [3].The inventory control problem, on the other hand, balances inventory costs and service levels through a Markov decision process to improve the efficiency of inventory management.
Current research focuses more on the discrete aspects of closed-loop supply chains and lacks coordination and optimization from a holistic perspective.Meanwhile, many studies are based on deterministic or static assumptions, ignoring dynamic and uncertain factors such as demand fluctuations, price changes, and technological advances in the supply chain.In view of this, this paper adopts the Markov decision-making process to model the decision-making and coordination mechanism of the closed-loop supply chain of power batteries, aiming to analyze the impact of these uncertainties and dynamics on the interests of all parties in the supply chain and the overall efficiency, in order to promote the theoretical and practical development in the field of closed-loop supply chain management.

Research on the Application of the Markov Decision Process in Supply Chain Management
The Markov Decision Process (MDP) is a mathematical model that describes the process of decision making in uncertain environments.MDP consists of a set of states, a set of actions, a transfer probability, and a reward function.MDP can be solved for the optimal strategy, i.e., what action can be taken to maximize the desired reward in each state.MDP has a wide range of applications in supply chain management, such as inventory control, demand management, transportation scheduling, etc.
Inventory control is the determination of inventory levels and replenishment strategies to balance inventory costs and service levels.The main approaches to inventory control are the economic order quantity model, the new svendor model, and the Markov decision process.Both the economic order quantity model and new svendor model are based on deterministic or static assumptions and ignore the uncertainty and dynamics of demand and supply.MDP can consider the stochastic and time-varying nature of demand and supply and solve the optimal replenishment strategy by methods such as dynamic planning or reinforcement learning.For example, Zhou Yang et al. (2023) [4] proposed a dynamic supplier classification management model based on the Markov decision process, which can dynamically adjust the inventory level and replenishment strategy according to the changes of suppliers; Huang Shuai-Bo et al. (2022) [5] proposed an energy management strategy model based on the Markov decision process, which can dynamically determine the charging station's energy level according to factors such as the grid price, the charging demand, the storage device, and the renewable energy source, dynamically determining the energy inventory level and replenishment strategy of charging stations; Riccardo A et al. (2023) [6], based on an Italian survey, analyzed the impact of Industry 4.0 technologies on the performance of closed-loop supply chains, which includes the impact on inventory control, e.g., real-time data sharing, intelligent forecasting, and adaptive adjustments can improve the efficiency and accuracy of inventory control.
Demand management is the process of forecasting and influencing consumer demand for products or services in order to improve the efficiency and competitiveness of the supply chain.The main methods of demand management are statistical forecasting, collaborative forecasting, and demand signaling.Statistical forecasting and collaborative forecasting are based on historical data or information sharing to predict future demand, while ignoring the uncertainty and variability of consumer behavior.MDP can consider the stochastic and time-varying nature of consumer behavior and solve the optimal demand management strategy through methods such as dynamic programming or reinforcement learning.For example, Liu Zhengyuan et al. (2023) [7] established a framework for supply chain reliability analysis based on a propagation dynamics model, in which the effects of two mechanisms, information propagation and influence propagation, on supply chain demand management are considered, e.g., information propagation improves supply chain adaptability and synergy, and influence propagation improves supply chain stability and resilience; Zhang Xuelong et al. (2019) [8] established a Markov-chain-based supply chain trust evolution game model, which can analyze the impact of the trust relationship between the nodes in the supply chain on the supply chain demand management, showing that the trust relationship can promote the behavior of information sharing, demand coordination, risk sharing, etc., so as to improve the supply chain's demand satisfaction rate and the demand response rate; Feng Z et al. (2023) [9] investigated the participation of an "Internet+Recycling" platform in the selection strategy of a two-tier remanufacturing closed-loop supply chain.The strategy involves demand management issues, such as how to determine the appropriate "Internet+Recycling" platform participation methods according to the demand of different types of consumers for recycled products or services, so as to improve the attractiveness of recycling demand and recycling efficiency.
Transportation scheduling refers to determining the allocation and scheduling of transportation resources to minimize transportation costs and maximize transportation efficiency.The main methods of transportation scheduling are mathematical planning, heuristic algorithms, and meta-heuristic algorithms.Mathematical planning and heuristic algorithms are based on deterministic or static assumptions and ignore the uncertainty and dynamics of transportation demand and transportation environment.MDP can consider the stochastic and time-varying nature of the transportation demand and transportation environment and solve the optimal transportation scheduling strategy by methods such as dynamic planning or reinforcement learning.For example, Ali P et al. (2023) [10] incorporated the vehicle path problem into the optimization model of a closed-loop supply chain network, considered factors such as product demand, recycling volume, transportation cost, and environmental impacts, and developed a mixed-integer linear programming model to minimize the total cost of the closed-loop supply chain network and proposed an effective solution algorithm; Mehrnaz B et al. [11] designed a new closedloop supply chain network with a location-allocation and routing model that considers simultaneous recycling and distribution and optimizes under uncertainty.The model involves problems in transportation scheduling, such as how to determine appropriate location, allocation, and routing schemes according to the recycling and distribution demands in different regions and time periods, so as to optimize the transportation cost, transportation time, transportation distance, and other metrics; Hao G et al. [12] proposed a hybrid differential evolutionary algorithm for solving the location-inventory problem in a closed-loop supply chain with product recycling.The problem involves transportation scheduling aspects, such as how to determine the appropriate location, inventory level, and transportation resource allocation scheme according to the recycling and distribution demand in different regions and time periods, so as to optimize the transportation cost, transportation time, transportation distance, and other metrics.
The above studies mainly focus on single or partial links in the supply chain and less on the coordination and optimization of the whole supply chain system.In addition, most of these studies are based on a single decision maker or a single objective, while ignoring the existence of multiple decision makers or multiple objectives in the supply chain, such as profit, cost, service, environment, and so on.Therefore, this paper attempts to model the decision-making and coordination mechanism of a closed-loop supply chain for power batteries from a holistic and multi-objective perspective using a Markov decision process and analyze its impact on the interests of all parties in the supply chain and the overall efficiency.

Research Progress in Closed-Loop Supply Chain and Reverse Logistics
Investigation into closed-loop supply chains and reverse logistics is increasingly crucial given the growing significance of environmental and resource conservation.Closedloop supply chains are an advanced form of traditional supply chain that facilitate the return of products from consumers and their subsequent reintegration into manufacturing cycles.In contrast, reverse logistics is a critical component of product circulation within these closed-loop systems.
Reverse logistics serves a distinct function within conventional supply chains, concentrating on the efficient retrieval and treatment of returned items and waste, followed by their reintegration as resources into manufacturing and consumer activities.In the research conducted by Kolyaei et al. [13], they implemented an integrated robust optimization strategy for designing closed-loop supply chain networks.This approach addresses the challenge of optimizing supply chain configurations amidst uncertainty, thereby effectively minimizing risks and enhancing operational efficiency.The current body of literature indicates that proficient management of closed-loop supply chains can lead to beneficial outcomes in economic, environmental, and societal spheres.In another study, Gu et al. [14] examined the influence of governmental incentives on the recycling strategies for electric vehicle batteries.They assessed the economic repercussions of these recycling initiatives through the incorporation of policy analysis.
Reverse logistics research also covers a wide range of thematic areas, including return processing, remanufacturing, product repair, and material recovery.In this study, reverse logistics, as a component of the closed-loop supply chain, focuses on how to maximize the efficiency benefits at the supply chain level, especially in the recycling of power batteries for new energy vehicles, aiming to solve the problem of how to reintroduce discarded power batteries into the production process as a kind of resource to achieve the sustainable use of resources.The research in this paper draws on existing literature and further analyzes and explores new paths for decision making in uncertain and dynamic environments within the framework of Markov decision-making process applications.

Fundamentals of the Markov Decision Process
The Markov Decision Process (MDP) is a mathematical model that describes the process of making decisions in uncertain environments.MDP consists of a state set, an action set, a transfer probability, and a reward function.The set of states is all possible situations faced by the decision maker, the set of actions is all possible actions that the decision maker can take in each state, the transfer probability is the probability of transferring to the next state after each action is taken in each state, and the reward function is the instantaneous reward obtained after each action is taken in each state.
The solution methods of MDP mainly include two categories: dynamic programming and reinforcement learning.Dynamic programming methods are based on the known information of the model and find the optimal policy through value iteration or policy iteration algorithms.The value iteration algorithm derives the optimal policy by continuously updating the long-term expected reward value of the state until the optimal value function is found; the policy iteration algorithm directly optimizes the policy and finds the optimal policy by iterating through the policy evaluation and improvement steps.Reinforcement learning methods, on the other hand, do not rely on complete information about the model but learn the optimal policy through interaction with the environment.Monte Carlo algorithms and temporal difference algorithms are the two main approaches to reinforcement learning, with the former learning based on complete rounds of data and the latter updating the policy through immediate feedback at each step.
In the context of the closed-loop supply chain of power batteries, the application of the MDP model has significant advantages.Wei Guoxin et al. (2019) [15], through research based on a Markov model, proposed a battery life prediction method, which provides a theoretical basis for the maintenance and replacement of power batteries.Yang Zhe (2018) [16], on the other hand, utilized a strategy based on the Markov algorithm to smooth the power allocation of hybrid vehicles and improve the energy utilization efficiency.These studies show that the MDP model can effectively deal with the uncertainty and dynamics in the closed-loop supply chain of power batteries and provides a powerful tool for decision making and coordination in the supply chain.

Modeling the Markov Decision Process of Power Battery Closed-Loop Supply Chain
In this paper, we consider a closed-loop supply chain for power batteries consisting of manufacturers, distributors, consumers, recyclers, and reusers.The manufacturer is responsible for producing new power batteries and selling them to the distributor, the distributor is responsible for providing new power batteries and recycling services to the consumer, the consumer is responsible for using the power batteries and delivering them to the distributor or recycler, the recycler is responsible for recovering used power batteries from the consumer or the distributor and selling them to the reuse vendor, and the reuse vendor is responsible for reusing the used power batteries and selling them to the manufacturer or the distributors.In this paper, it is assumed that all parties in the supply chain behave rationally and self-interestedly, i.e., each participant tries to maximize its own profit.
In order to model the Markov decision process of the closed-loop supply chain of power batteries, the state set, action set, transfer probability, and reward function need to be determined.Since there are multiple decision makers in the supply chain, this paper adopts the framework of the Multi-Agent Markov Decision Process (MAMDP), i.e., each decision maker has its own set of states, set of actions, transfer probabilities, and reward functions, but their decisions affect each other.The MDP model for each decision maker is explained separately below.
Manufacturer's MDP model: State set: The state of the manufacturer consists of two variables, the current inventory of new power cells held by the manufacturer and the inventory of reused power cells.Assume that both the manufacturer's inventory of new power cells and the inventory of reused power cells are discrete and have upper bounds, which are denoted as  and , respectively.Then, the manufacturer's state set is  ,  | 0,1, … ,  ;  0,1, … ,  .Action set: The manufacturer's action consists of two variables, the number of new power cells to be produced by the manufacturer in the next period and the number of reused power cells to be purchased from the reutilizer.Assume that the number of new power cells to be produced and the number of reused power cells to be purchased by the manufacturer are discrete and have upper bounds, which are denoted as  and  , respectively.
Then, the set of manufacturer's actions is  ,  | 0,1, … ,  ;  0,1, … ,  ;  0,1, … ,  .Transfer probability: The transfer probability of a manufacturer is given by the following equation: where  denotes the manufacturer's unit cost of producing a new power cell,  denotes the manufacturer's unit price of purchasing a reused power cell,  denotes the manufacturer's unit inventory cost of holding a new power cell, and  denotes the manufacturer's unit inventory cost of holding a reused power cell.This paper assumes that these parameters are known and constant.
Distributor's MDP model: State set: The state of a distributor consists of four variables, namely the current inventory of new power cells held by the distributor, the inventory of reused power cells, the inventory of used power cells, and the consumer demand for new power cells.It is assumed that the distributor's inventory of new power batteries, inventory of reused power batteries, and inventory of used power batteries are discrete and have upper bounds, which are denoted as  ,  , and  , respectively.Suppose the consumer demand for new power batteries is also discrete and has an upper bound, denoted as  .
Then, the set of states of distributors is  , , ,  | 0,1, … ,  ;  0,1, … ,  ;  0,1, … ,  ;  0,1, … ,  .Action set: The distributor's action consists of four variables, namely, the number of new power cells and the number of reused power cells that the distributor will order from the manufacturer and the number of used power cells that the distributor will sell to the recycler and the number of used power cells that the distributor will buy from the recycler in the next period.It is assumed that the number of new power cells ordered and the number of reused power cells ordered by the distributor are discrete and have upper bounds, denoted as  and  , respectively.Assume that the number of used power cells sold and the number of used power cells purchased by the distributor are discrete and have upper bounds, denoted as  and  , respectively.ℎ where   denotes the probability that a manufacturer will provide new power cells or sell power cells to a distributor in the next period as  ,   denotes the probability that a consumer will purchase new power cells or deliver used power cells to a distributor in the next period as  , and   denotes the probability that a recycler will acquire or provide used power cells to a distributor in the next period as  .In this paper, it is assumed that all these probabilities are known and obey some known probability distribution, such as Poisson distribution, normal distribution, and so on.
Reward function: The distributor's reward function is given by the following equation: , , ,  , , , , where  denotes the unit price at which the distributor orders new power cells from the manufacturer,  denotes the unit price at which the distributor orders reused power cells from the manufacturer,  denotes the unit price at which the distributor sells used power cells to the recycler,  denotes the unit price at which the distributor buys used power cells from the recycler,  denotes the unit cost of inventory at which the distributor holds the new power cells,  denotes the unit cost of inventory at which the distributor holds the reused power cells, and denotes the unit cost of inventory at which the distributor holds the used power cells. denotes the unit inventory cost of reused power batteries held by the distributor and denotes the unit inventory cost of used power batteries held by the distributor.This paper assumes that these parameters are known and constant.MDP modeling for consumers: State set: The consumer's state consists of two variables, the remaining capacity of the power battery currently used by the consumer and the consumer's demand for a new power battery.Assume that both the remaining capacity of the consumer's power battery and the demand are discrete and have upper bounds, which are denoted as  and  , respectively.
Then, the set of consumers' states is  ,  | 0,1, … ,  ;  0,1, … ,  .Action set: The consumer's action set consists of two variables, the number of new power cells the consumer will purchase from the distributor and the number of used power cells the consumer will deliver to the distributor or recycler in the next period.Assume that the number of new power batteries to be purchased and the number of used power batteries to be delivered by the consumer are discrete and have upper bounds, which are denoted as  and  , respectively.Then, the set of consumer actions is  ,  | 0,1, … ,  ;  0,1, … ,  .Transfer probability: The transfer probability of a consumer is given by the following equation: ,  ,  , ,           ,  0       0     0, ℎ where   denotes the probability that the number of power batteries used by consumers in the next period is  , and   denotes the probability that the demand for new power batteries by consumers in the next period is  .In this paper, it is assumed that both probabilities are known and obey some known probability distribution, such as Poisson distribution, normal distribution, and so on.
Reward function: The reward function of the consumer is given by the following equation: where  denotes the unit value of the power battery used by the consumer,  denotes the unit price at which the consumer purchases a new power battery from a distributor,  denotes the unit price at which the consumer delivers a used power battery to a distributor or recycler, and  denotes the unit cost of the consumer's demand for a new power battery.This paper assumes that these parameters are known and constant.
Recycler's MDP model: State set: The state of the recycler consists of a variable, i.e., the current inventory of used power batteries held by the recycler.Suppose the recycler's inventory of used power batteries is discrete and has an upper limit, denoted as  .Then, the state set of the recycler is  | 0,1, … ,  .Action set: The recycler's action consists of two variables, the number of used power cells the recycler wants to sell to the reuse vendor and the number of used power cells the recycler wants to buy from the distributor or consumer in the next period.Assume that the number of used power cells to be sold and the number of used power cells to be purchased by the recycler are discrete and have upper bounds, denoted as  and  , respectively.
Then where   denotes the probability that the number of used power batteries supplied by the reuse vendor to the recycler in the next period is  , and   denotes the probability that the number of used power batteries supplied by the distributor or consumer to the recycler or acquired by the recycler in the next period is  .In this paper, it is assumed that both probabilities are known and obey some known probability distributions, such as Poisson distribution, normal distribution, and so on.
Reward function: The reward function of the recycler is given by the following formula: where  denotes the unit price at which the recycler sells the used power battery to the reuse vendor,  denotes the unit price at which the recycler buys the used power battery from the distributor or the consumer, and  denotes the unit inventory cost at which the recycler holds the used power battery.This paper assumes that these parameters are known and constant.
The MDP model for reutilizers: State set: The state of the reuser consists of one variable, i.e., the current inventory of used power cells held by the reuser.Assume that the reutilizer's inventory of used power cells is discrete and has an upper limit, denoted as  .Then, the set of states of the reutilizer is  | 0,1, … ,  .Action set: The action of a reuser consists of two variables, the number of reused power cells that the reuser wants to sell to a manufacturer or distributor in the next period and the number of used power cells that the reuser wants to buy from a recycler.Assume that the number of reused power cells to be sold and the number of used power cells to be purchased by the reuse vendor are discrete and have upper bounds, denoted as  and  , respectively.Then, the set of actions of the reutilizer is Transfer probability: The transfer probability of the reutilizer is given by the following equation: where   denotes the probability that a manufacturer or distributor orders a quantity of reused power batteries from a reuse vendor in the next period as  , and   denotes the probability that a recycler provides a quantity of used power batteries to a reuse vendor in the next period as  .In this paper, it is assumed that both probabilities are known and obey some known probability distribution, such as Poisson distribution, normal distribution, and so on.
Reward function: The reward function of the reutilizer is given by the following equation:

𝑅 𝑤 , 𝑢, 𝑡 𝑐 𝑢 𝑐 𝑡 𝑐 𝑤
where  denotes the unit price at which the reuser sells the reused power cell to the manufacturer or distributor,  denotes the unit price at which the reuser buys the used power cell from the recycler, and  denotes the unit inventory cost at which the reuser holds the used power cell.This paper assumes that these parameters are known and constant.So far, this paper has modeled the Markov decision-making process of each decision maker in the closed-loop supply chain of power batteries.

Methods for Solving the Model
Since there are multiple decision makers in the closed-loop supply chain of power batteries and their decisions affect each other, this paper adopts the framework of the Multi-Agent Markov Decision Process (MAMDP), which considers each decision maker in the supply chain as an agent, and each agent has its own set of states, set of actions, transfer probabilities, and reward function, but their transfer probabilities and reward functions are affected by the other agents' actions.Therefore, this paper needs to consider the game and coordination problem in the supply chain, i.e., how to make each agent maximize its own interests and also maximize the overall efficiency of the supply chain.
In this paper, the following two methods are used to solve the MAMDP model: Dynamic-planning-based approach: This approach is based on modeling, i.e., it is necessary to know the transfer probability and reward function of each agent.The method has two steps: the first step is to use the concept of Nash Equilibrium (NE) to solve for the optimal combination of actions in each state, i.e., in each state, no agent can improve its long-term expected reward by changing its actions.The second step is to use algorithms such as value iteration or policy iteration to solve for the optimal value function in each state, i.e., the maximum long-term expected reward that can be obtained after making decisions according to the optimal combination of actions in each state.The advantage of this method is that it can guarantee to find the global optimal solution, but the disadvantage is the high computational complexity and the need for complete and symmetric information.
Reinforcement-learning-based approach: This approach is based on data, i.e., instead of knowing the transfer probability and reward function of each agent, the optimal policy is learned by interacting with the environment to generate data.The approach has two steps: the first step is to use the concept of the Multi-Armed Bandit (MAB) to design an exploration-exploitation algorithm that allows each agent to balance the trade-off between exploring new actions and exploiting known ones during the learning process.The second step is to use algorithms such as Monte Carlo or temporal differencing to estimate the value function for each state or state-action pair and determine the optimal policy based on the estimates.The advantage of this method is that it can adapt to dynamically changing environments and does not require complete and symmetric information, but the disadvantage is that it may fall into local optimal solutions and has a slow convergence rate.

Case Studies
This section focuses on a specific case study to validate the effectiveness and practicality of the Markov decision process model of the closed-loop supply chain of power batteries established in this paper and to compare the performance and effectiveness of the two solution methods based on dynamic programming and based on reinforcement learning under different parameters and scenarios.

Case Selection and Data Collection
This paper involves a new energy vehicle manufacturer in China as the subject of the case study, which produces a variety of new energy vehicles using lithium-ion power batteries and has established a closed-loop supply chain for power batteries involving multiple distributors, consumers, recyclers, and reusers.In this paper, based on the data provided by this manufacturer and related literature, each parameter in the model is reasonably set and estimated, as shown in Table 1.To simplify the model, the following points are assumed in this paper: In this paper, only one manufacturer, one distributor, one consumer, one recycler, and one reuse provider are considered, i.e., competition and diversity in the supply chain are ignored.
In this paper, only one variety of power battery is considered, i.e., product differences and diversity in the supply chain are ignored.
In this paper, only one period of decision making is considered, i.e., temporal differences and diversity in the supply chain are ignored.
In this paper, we assume that all probability distributions are known and obey the Poisson distribution, i.e., uncertainty and complexity in the supply chain are ignored.
These assumptions are made to facilitate the solution and analysis of the model and do not affect the generality and scalability of the model.Future research could relax these assumptions to increase the adaptability and usefulness of the model.

Analysis Using Markov Decision Process Models
In this paper, two solution methods based on dynamic programming and based on reinforcement learning are implemented using the Python programming language and related mathematical and statistical libraries, and the models are simulated and analyzed.The following metrics are used in this paper to evaluate the performance and effectiveness of the models and methods: Profit for each party in the supply chain: The total revenue minus the total cost earned by each decision maker over a period of time.
Overall supply chain efficiency: The average of the sum of the profits of the supply chain parties over a period of time.
Overall supply chain utility: The weighted average of the sum of the profits of the supply chain parties over a period of time, where the weights reflect the importance that the supply chain parties place on profits, and this can be set by the decision makers themselves or determined by the coordination mechanism.
Convergence of the model: Refers to whether the model is able to reach a stable state or strategy within a limited number of iterations or interactions.
Model robustness: Refers to the ability of the model to adapt to different parameters and scenarios, such as demand fluctuations, price changes, technological advances, etc.
In this paper, we first compare the performance and effectiveness of two methods based on dynamic programming and based on reinforcement learning under different discount factors.The dynamic-planning-based approach can guarantee to find the global optimal solution, i.e., to maximize the overall efficiency and utility of the supply chain under any discount factor.In contrast, reinforcement-learning-based methods may fall into a local optimal solution, i.e., the overall supply chain efficiency and utility cannot be maximized under certain discount factors.In addition, dynamic-programming-based methods require fewer iterations to converge, while reinforcement-learning-based methods require more interactions to converge.Therefore, the dynamic-planning-based approach outperforms the reinforcement-learning-based approach when the information is complete and symmetric.

Environment and Library Configuration
This study was implemented using Python 3.8 for programming and the following key libraries were utilized: numpy: For creating and manipulating arrays and matrices for efficient linear algebra operations.scipy: Provides rich numerical computation tools, including linear programming solvers, etc., for numerical analysis and optimization in models.
gym: An open source toolkit for reinforcement learning environments that provides multiple test problems and algorithmic interfaces for easy simulation and evaluation.stable_baselines3: A library of reinforcement learning algorithms based on Tensor-Flow 2 for implementing and testing different reinforcement learning strategies.matplotlib: For data visualization, plotting graphs, and presenting results.

Implementation of the Dynamic Programming Algorithm
The implementation of the dynamic programming algorithm in this study consists of two main methods: value iteration and strategy iteration (Algorithm 1): Algorithm 1 # Python code for the value iteration algorithm.1:import numpy as np 2:def value_iteration(transition_probs, rewards, gamma = 0.9, threshold = 0.01).Value iteration: The value function of each state is updated by iteration until it converges to the optimal value function.Thereafter, the optimal policy is derived from the value function.
Policy iteration: A policy is randomly initialized, and then the optimal policy is found through iterations of policy evaluation and policy improvement, from which the optimal value function is derived (Algorithm 2).Algorithm 2 # Python code for the strategy iteration algorithm.1:def policy_evaluation(policy, transition_probs, rewards, gamma = 0.9, threshold = 0.01).Both methods are implemented through Python's looping structure and conditional judgment statements, which ensure that the algorithm stops iterating when a convergence condition is reached.

Implementation of Reinforcement Learning Algorithm
Applications of reinforcement learning in this study include round-based and stepbased algorithms: Monte Carlo Algorithm: This is a round-based algorithm that estimates the value of a state or state-action by sampling complete rounds and optimizes it step by step.
Temporal difference algorithms, e.g., SARSA and Q-learning: These algorithms update state or state-action value estimates in real time based on single-step reward and state information (Algorithm 3).obs, rewards, dones, info = env.step(action)17: env.render()These algorithms were implemented through Python programming and tested through the MDP environment interface provided by OpenAI Gym.Meanwhile, the PPO and DQN algorithms from the stable_baselines3 library were used for training to automate reinforcement learning.

Analysis and Validation of Results
The results are analyzed mainly by calculating the profit metrics of each party in the supply chain.Learning curves were also drawn in this study to reflect the convergence of different algorithms.The impact of the discount factor on the performance of the algorithms was evaluated through parameter sensitivity analysis.The following is the Python code implementation of these steps.

Decision-Making Mechanisms Based on Markovian Decision Process Models
The decision-making mechanism based on the Markov decision process model refers to the supply chain parties dynamically adjusting their actions to maximize their own interests based on the current state and future expectations.The decision mechanism has the following characteristics: 1.The decision-making mechanism is decentralized, i.e., each decision maker can make decisions independently without the need to communicate or consult with other decision makers.2. The decision-making mechanism is adaptive, i.e., each decision maker can continuously update his/her state and strategy in response to changes and feedback from the environment in order to adapt to uncertainty and dynamics.3. The decision mechanism is intelligent, i.e., each decision maker can learn and optimize to find the optimal or near-optimal action to improve his or her long-term desired reward.
The steps for the implementation of this decision-making mechanism are set out below: Step 1: Initialization.Each decision maker needs to initialize its state, action, value function, and strategy.The state and action can be set according to the actual situation, the value function can be initialized randomly or zero, and the strategy can be initialized randomly or uniformly.
Step 2: Observation.Each decision maker needs to observe the current state and the actions of the other decision makers and calculate its own immediate reward based on the transfer probability and reward function.
Step 3: Learning.Each decision maker needs to update its value function and policy based on the observed data.Algorithms such as value iteration or policy iteration can be used if dynamic-programming-based methods are used; algorithms such as Monte Carlo or temporal differencing can be used if reinforcement-learning-based methods are used.
Step 4: Execution.Each decision maker needs to select an action and execute it based on the current state and the updated strategy.
Step 5: Repetition.Each decision maker needs to repeat steps 2 through 4 until the termination conditions are met, such as convergence, maximum number of iterations reached, or number of interactions.
This decision-making mechanism allows supply chain parties to maximize their own interests without a central coordinator or information sharing.However, there are some drawbacks of this decision-making mechanism, such as it may lead to the reduction of the overall efficiency and utility of the supply chain, the imbalance of interests among the supply chain parties, and a lack of trust among the supply chain parties.Therefore, in the next section of this paper, a coordination mechanism will be designed to ameliorate these problems.

Coordination Mechanisms Based on Markov Decision Process Models
The coordination mechanism based on the Markov decision process model refers to the rational allocation and incentives, so that all parties in the supply chain can consider the maximization of the overall efficiency and utility of the supply chain while pursuing their own interests.This coordination mechanism has the following characteristics: 1.The coordination mechanism is centralized, i.e., a central coordinator is needed to design and implement the coordination mechanism, as well as to communicate or consult with all parties in the supply chain.2. The coordination mechanism is contractual in nature, i.e., it requires a contract or agreement to bind the supply chain parties to their behaviors and responsibilities, as well as to specify the benefits and risks for each party in the supply chain.3. The coordination mechanism is incentive-based, i.e., it needs to provide incentives or penalties to motivate supply chain parties to comply with the contract or agreement, as well as to promote the overall efficiency and utility of the supply chain.4. The specific steps for the implementation of this coordination mechanism are set out below: Step 1: Determine the objectives.The central coordinator needs to determine the objective function for the overall efficiency and utility of the supply chain and the degree of importance, i.e., the weights, that each party in the supply chain places on profit.The objective function can be linear or non-linear, and the weights can be fixed or variable.
Step 2: Design contracts.The central coordinator needs to design contracts or agreements to specify the actions that each supply chain party should take in each state, as well as the rewards or penalties to be assigned based on the actions and outcomes.The contract or agreement can be complete or incomplete, i.e., whether it contains all possible states and actions.
Step 3: Enforce the contract.The central coordinator needs to monitor whether supply chain parties are making decisions in accordance with the contract or agreement and enforce rewards or penalties based on decisions and results.If supply chain parties are found to have violated the contract or agreement, the central coordinator can take appropriate measures, such as termination of the contract, claims, and lawsuits.
Step 4: Update contracts.The central coordinator needs to update the objective function, weights, contracts or agreements, etc., in response to changes and feedback from the environment in order to adapt to uncertainty and dynamics and to improve the overall efficiency and utility of the supply chain.
This coordination mechanism can enable supply chain parties to maximize the overall efficiency and utility of the supply chain with a central coordinator and information sharing.However, there are some challenges in this coordination mechanism, such as how to determine reasonable objective functions, weights, contracts or agreements, etc., and how to ensure the authenticity and integrity of supply chain parties.Therefore, in the next section of this paper, the effectiveness of this coordination mechanism is evaluated and some suggestions for improvement are made.

Assessment of the Effectiveness of Coordination Mechanisms and Recommendations for Improvement
This paper evaluates the effectiveness of the coordination mechanism based on the Markov decision process model through simulation experiments and makes some suggestions for improvement.This paper uses the following indicators to evaluate the effectiveness of the coordination mechanism: 1. Rate of increase in the overall efficiency of the supply chain: The percentage increase in the overall efficiency of the supply chain after the use of coordination mechanisms compared to before the use of decision-making mechanisms.2. Rate of increase in the overall utility of the supply chain: The percentage increase in the overall utility of the supply chain after the use of the coordination mechanism compared to before the use of the decision-making mechanism.3. Equity in profit distribution among supply chain parties: This refers to whether the distribution of profits among supply chain parties is in line with their contributions and expectations after the use of the coordination mechanism and whether there is any imbalance or exploitation in profit distribution.
4. Contract compliance rate of supply chain parties: This refers to whether, after using the coordination mechanism, supply chain parties make decisions in accordance with the contract or agreement and whether there is any violation of the contract or agreement.5.In this paper, we design different coordination mechanisms based on different objective functions, weights, contracts, or agreements, and compare their effects under different parameters and scenarios.
After using the coordination mechanism, the overall efficiency and utility of the supply chain have been improved to different degrees, but there are some problems, such as unfair profit distribution and low contract compliance rate.Therefore, this paper puts forward the following suggestions for improvement: 1.When determining the objective function, the multiple objectives of the supply chain parties, such as profit, cost, service, and environment, should be considered and weighed and balanced according to the actual situation and priorities.2. In determining the weights, the benefit preferences and risk preferences of all parties in the supply chain should be taken into account and allocated and adjusted according to the actual situation and the principle of fairness.3. The incomplete and asymmetric information of the supply chain parties should be taken into account when designing contracts or agreements, and they should be designed and optimized according to the actual situation and incentive principles.4. The truthfulness and good faith of the parties in the supply chain should be taken into account in the execution of the contract or agreement, which should be monitored and enforced in accordance with the actual situation and the principle of constraint.

Limitations of the study and future prospects
In the research process of constructing a decision-making and coordination mechanism for the closed-loop supply chain of power batteries, the model in this paper relies on a series of assumptions, which has a certain impact on the research results.While the simplification of the model helps to focus on the analysis of core decisionmaking issues, it also limits the generalizability of the results.The model limits the number of supply chain participants to a single manufacturer, distributor, consumer, recycler, and reuse provider, excluding the phenomenon of competition in the supply chain.The assumption of no other competitors may deviate from the complex reality of the business environment.When markets are competitive, supply chain participants must make adjustments to their strategic decisions to remain competitive, a dynamic that the current study fails to encompass.In addition, the types of power batteries involved are simplified to a single type.In practice, the supply chain needs to deal with multiple types of batteries, each of which may have different recycling, storage, and transportation needs, which adds additional complexity.This study focuses on a single decision cycle and fails to fully explore strategy changes under the influence of long-term and complex uncertainties.Future work could explore the adaptability and sustainability of supply chain decisions in long-term operations by incorporating multiple decision cycles.In response to these limitations, future research could extend the model to reflect more competitors in the supply chain and consider different types of batteries.This would improve the utility and broad applicability of the model.At the same time, extending the model to a long-term multi-period decision-making environment and combining multiple probability distributions to consider more complex uncertainties can bring the model closer to actual business operations.These improvements are expected to provide finer and more comprehensive strategic recommendations for closed-loop supply chain management of power batteries, helping practitioners to optimize supply chain operations and achieve the dual goals of environmental sustainability and economic efficiency.

Conclusions and Outlook
This study successfully constructs and analyzes the decision-making and coordination mechanism of a closed-loop supply chain for power batteries by applying a Markov Decision Process (MDP) model.The core contribution of the study is the development of a comprehensive modeling framework that can effectively address the uncertainty and dynamics in the supply chain and provide coordination strategies for the multiple decision makers involved.Through case studies, this study validates the effectiveness of the proposed model and demonstrates the application of two solution methods, dynamic programming and reinforcement learning, under different conditions.
The research results show that the constructed MDP model can effectively improve the overall efficiency and effectiveness of the closed-loop supply chain of power batteries, which provides new perspectives and solutions for supply chain management.The model solution results reveal the optimal strategies that should be adopted by all parties in the supply chain under different decision cycles and probability distributions and how to maximize the overall efficiency through the coordination mechanism.
However, there are some limitations to the research.The current model is based on a series of simplifying assumptions, such as a single supply chain participant and a single type of battery, which limits the generalizability of the model.Future research can improve the utility and adaptability of the model by introducing more supply chain participants, multiple battery types, and a long-term multi-period decision-making environment.In addition, exploring different probability distributions and utilizing data-driven approaches to learn about uncertainty in the supply chain will further enhance the accuracy and applicability of the model.
In summary, this study provides an innovative solution to the problem of decisionmaking and coordination in the closed-loop supply chain of power batteries and points out the direction for future research.The research results are not only theoretically significant but also provide a valuable reference for supply chain management in practice.Despite the limitations, the results of this study lay a solid foundation for further exploration in the field of supply chain management.