Machine Learning-Based Model Predictive Control for Collaborative Production Planning Problem with Unknown Information

: In industrial production planning problems, the accuracy of the accessible market information has the highest priority, as it is directly associated with the reliability of decisions and affects the efﬁciency and effectiveness of manufacturing. However, during a collaborative task, certain private information regarding the participants might be unknown to the regulator, and the production planning decisions thus become biased or even inaccurate due to the lack of full information. To improve the production performance in this speciﬁc case, this paper combines the techniques of machine learning and model predictive control (MPC) to create a comprehensive algorithm with low complexity. We collect the historical data of the decision-making process while the participants make their individual decisions with a certain degree of bias and analyze the collected data using machine learning to estimate the unknown parameter values by solving a regression problem. Based on an accurate estimate, MPC helps the regulator to make optimal decisions, maximizing the overall net proﬁt of a given collaborative task over a future time period. A simulation-based case study is conducted to validate the performance of the proposed algorithm in terms of estimation accuracy. Comparisons with individual and pure MPC decisions are also made to verify its advantages in terms of increasing proﬁt. colored curves in Figure Note that the optimal decisions were all within the saturation range [ 0, 1.5 ] , which means that Algorithm 1 could handle the input constraint appropriately. Fur-thermore, the optimal decisions had a causal relationship with the production demand (47).


Introduction
In modern industry, the participants of a manufacturing task usually work in a collaborative mode, which substantially increases productivity [1]. The individual decisions of the participants contribute to changes in state variables within the industrial process, which in turn affect their own task performance. To address this decision-making process, a dynamic model of the given collaborative task is established to represent the interaction and communication between the participants [2]. Within these collaborative tasks, each participant naturally makes individual decision to optimize their own task performance, if no decision is made by the regulator. However, the optimal decisions for the overall optimal task performance usually conflict with these individual decisions according to the Nash Equilibrium [3]. Furthermore, the participants might even make individual decisions with a certain bias and fail to optimize their individual benefits due to a lack of professional experience. Therefore, an essential responsibility for regulators is to optimize the overall task performance rather than any individual's performance, which creates production planning problems in a collaborative mode.
The class of production planning problems aims at making appropriate decisions for an industrial manufacturing process to optimize the performance of certain tasks and yield various practical benefits, such as maximum net profits [4][5][6], fuzzy multiple objectives [7][8][9], economic production quantity [10] and optimal emission policy [11,12]; see [13][14][15][16][17] for detailed overviews of production planning problems. Most existing production planning methods assume that all information within their problem formulations are known as a priori. Nevertheless, this paper focuses on the production planning problems in collaborative tasks, which contain private information regarding participants that might be unknown to the regulator. In this sense, the regulator cannot obtain the full scope of accurate market information to make optimal decisions for participants, which is an obstacle to making reliable decisions.
To accurately estimate the parameter values of unknown private information, the data-driven technique of machine learning is considered in this paper, which is a collection of approaches for making inferences based upon historical data [18]. This approach extracts useful information from a realistic data set to tune the weighting parameters of a trained model, meaning that this model asymptotically returns desired information to the users [19]. This advantage unlocks intelligent solutions to certain design problems that can hardly be characterized using classical methods due to their complexity and fuzziness [20]. Numerous machine learning procedures have been proposed in different research areas, such as image classification [21,22], data classification and clustering [23][24][25], system identification [26], natural language processing [27], autonomous driving [28] and fault diagnosis [29,30]. Most of these state-of-the-art machine learning approaches are designed for classification tasks with various data types, such as images, audio and video. However, this paper addresses a parameter estimation task by solving a regression problem based on the recorded biased individual decisions, and a "gradient descent"-based machine learning pipeline is proposed to fulfill this task in an efficient manner.
The technique of MPC is employed in this paper to solve the specific production planning problem based on the estimated parameter values. We generally exploit the model and current state information to predict future information so that an appropriate decision is made to optimize the overall task performance over a future time period, which is computationally efficient [31]. Because of this key feature, various MPC methods have been implemented to improve the task performance of broad applications, such as sludge processes [32], micro-grid [33] and power electronics [34]; see [35][36][37] for more details about MPC. Although some existing works in the literature-e.g., [38][39][40]-handle the production planning problem with uncertain variables, their problem formulations were different from that discussed in this paper with disembedded machine learning or MPC methods. Moreover, this machine learning-based MPC method is capable of extracting the target information from a complex but real historical data set and carrying out the estimation procedure in an off-line fashion. It outperforms other on-line learning based MPC methods-e.g., [41][42][43][44]-in terms of its small computation load.
To overcome the aforementioned problems, this paper combines the techniques of machine learning and MPC to aid regulators to compute optimal decisions for the participants, solving the production planning problem with a maximum overall net profit. Note that some preliminary results of this paper have been reported in [45], but significant extensions are proposed in this paper. A detailed discussion is made of technical issues such as Remarks 1-4, an entirely new section (Section 4, Implementation Instructions) is added to help the reader to understand the behind-screen parameter tuning mechanism of the algorithm, the case study section employs a newly established model to handle the problem, and the results have been improved compared to those in the published paper because of the new contributions added in this paper. The contributions of this paper are highlighted as follows: • The production planning problem is formulated using a discrete time system, with task performance judged by net profit; • A gradient descent machine learning procedure with an adaptive learning scheme is developed to estimate the unknown parameters of the revenue in Q using historical data via solving a regression problem; • An MPC method uses the estimated values of Q as its user-defined weight factors to predict the optimal decisions to maximize net profit; • A machine learning-based MPC algorithm with low complexity is proposed and validated in a simulation-based case study; • A comparison with individual and pure MPC decisions is performed to show the increase in profit.
The nomenclature is listed as follows: R n is an n-dimensional Euclidean vector space, R m×n is an m × n real matrix space, a ij is the element in ith row and jth column of A, diag(·) is a diagonal matrix of its argument, is the spectral radius of a given matrix, is a cost function of the performance index, x ∞ is the infinity norm of the vector x, is the representation of normal distribution, x is the estimated value of x, L(·) is a loss function, Z is the set of integer numbers, N is the set of non-negative integer numbers, s is the production demand, Q is the weighting parameters of the revenue, R is the weighting parameters of the productivity effort, P is the decision bias parameters of the participants, Π is the benchmark of the unknown parameters.

Problem Formulation
This section introduces the system dynamics and uses certain cost functions to represent the desired task performance. Then, the production planning problem with a collaborative mode and unknown information is formulated, and the effect of unknown parameters is discussed.

System Dynamics
In this paper, the interaction among participants within a given collaborative manufacturing task is modeled by a discrete time system with the following state space representation: (1) The system input u(k) ∈ R n and the system state x(k) ∈ R n denote the productivity decisions and market sizes of all n participants at sample time index k, respectively. The matrices A and B are of compatible dimensions; i.e., A, B ∈ R n×n .
The future state x(k + 1) of the discrete time system (1) is triggered by the current state x(k) and the current input u(k). In other words, the market size of each participant at the next sample time depends on the current market sizes of the participants as well as the current productivity decision. Therefore, the individual decision of each participant influences the market sizes of others, and their market size is also affected by the individual decisions of others, which gives rise to the system dynamics. If ρ(A) 1, the discrete time system (1) is stable. The matrix B is diagonal, i.e., as the decision of an individual participant does not directly affect the market sizes of other participants, and its diagonal elements are non-negative since the productivity clearly has a positive relationship with respect to the market size.

Remark 1.
The problem formulation in this paper can be naturally extended to other non-linear systems without difficulties. However, this kind of extension requires extra modifications to the solution to the production planning problem, which is not discussed in this paper.

System Constraints
In practice, it should be realized that the system variables have specific physical limits that give rise to different forms of system constraints. For example, the system input represents the productivity of the participants at sample time index k, and all its elements u i (k), i = 1, . . ., n must be non-negative and have an upper bound for maximum productivity. The input constraint set Ω usually represents the acceptable system input range. In this paper, the input saturation constraint u(k) ∈ Ω, i.e., is considered and used to compute the solutions of the production planning problem in latter sections, whereū i denotes the input upper bound.

Production Planning Design Objectives
In a manufacturing task, the overall task performance is generally represented by a cost function f (u, x, k) consisting of the productivity decision u, the market size x and the sample time index k. For example, the change in market size is denoted as and the maximum productivity is denoted as Note that the cost function f (u, x, k) does not necessarily need to incorporate all variables of u, x and k, and its exact definition can change in terms of the target performance index to be optimized. To improve the task performance, the regulator should make an appropriate decision for the participants to optimize a certain performance index. This design objective brings significant practical benefits within the given task. Therefore, the production planning problem explored in this paper is formulated in the following definition. Definition 1. The Production Planning Problem refers to the choice of an appropriate productivity decision u(k) of a discrete time system (1) at sample time index k, such that the desired cost function f (u, x, k) is optimized over the time period [1, N]; i.e., To exemplify this design objective, this paper takes the overall net production profit of a collaborative manufacturing task to be the task performance, defined as where g(x, k) and h(u, k) are the revenue function and the effort function with respect to the sample time index k. This cost function makes sense in practice as the production effort made at sample time index k − 1 brings the revenue in the next sample time index k. The revenue function g(x, k) is in a quadratic form, where the diagonal matrix Q = diag(q 1 , q 2 , . . ., q n ) (10) contains the constant weighting parameters of the revenue for each participant, and the vector represents the production demands of the participants at sample time index k, which is a known time-varying value. In the problem formulation, the production demand s(k) is the required order number of the products and does not have a direct relationship with the market size x(k), which denotes the actual number of products made by the factory. However, they together contribute to the revenue function g(x, k). According to the quadratic form (9), the revenue of each participant first increases to a peak value and then decreases along its market size axis. Due to the fact of excess demand, the overall revenue decreases after a certain market size. Note that the revenue is zero for zero market size, which is realistic in practice. The production demand is not a fixed value as the demand for certain products may vary with respect to the sample time index k. In addition, the production demand has a positive effect on the revenue, as an increase of production demand increases the product price. The effort function h(x, k) is in a quadratic form, where the diagonal matrix R = diag(r 1 , r 2 , . . ., r n ) (13) contains the constant weighting parameters of the productivity effort for the participants. Based on the quadratic form (12), the effort of each participant increases quadratically as its productivity increases from the value of zero. This quadratic form is realistic as high productivity generally leads to extra costs in terms of human power, machine damage and energy consumption.
To solve the production planning problem (7), a reliable system model (1) and accurate values of R, Q and s(k) are required. However, unknown parameters occur frequently in the practical production planning problem. In this specific case, the participators naturally attempt to optimize their private performance index, and their individual decisions might even conflict with the optimal decisions made by the regulator for the overall task perfor-mance. Therefore, they might not share their private revenue weighting parameters q i with the regulator or provide inaccurate values. Thus, the matrix Q is generally unknown or inaccurate to the regulator, which restricts the reliability of optimal decisions.
To handle this problem, the data-driven method of machine learning is applied to estimate these unknown parameters by extracting and analyzing the information from the recorded historical data, and MPC is then used to provide optimal decisions for the regulator based on estimated parameter values.

Machine Learning-Based MPC
In this section, a formula is derived to model the individual decisions of the participants under the free decision making condition. A gradient descent machine learning training procedure with low complexity is proposed to estimate the unknown parameters using the recorded historical data and the obtained formula. Then, MPC is considered to solve the problem (7) based on the estimated values of the parameters to yield the optimal decisions.

Individual Decision Modeling
A free decision making condition is now supposed for the participators, meaning that there are no optimal decisions received from the regulator. Therefore, they have the authorization to make individual decisions for themselves. It is not surprising that the individual decisions are often biased from their optimal value due to the low response to market environment change and the restricted insight of the participators. In this sense, the decision bias matrix is introduced to provide the decision bias parameters of the participants. These parameters where µ i is the mean value and σ i is the standard deviation. The decision bias parameters represent the inherent difference between actual individual decisions and associated optimal options. By introducing these parameters, a representation formula of the individual decisions is derived and shown in the next theorem.

Theorem 1.
Under the free decision making condition, the i th participant attempts to maximize their individual task performance, defined by the cost function at sample time index k, and the individual decision is where Π i = (q i , p i ) involves the unknown parameters of the participant and function F i is defined by with the projection operator Proof. Under the free decision making condition, each participant makes their decision actively and focuses on maximizing their individual cost function (15), which gives rise to The problem (19) is a quadratic programming problem, and the partial differentiation of the cost function is performed with respect to u i to find the identical stationary point. As the problem is convex, this stationary point is the global minimum and gives the unconstrained solution Using the decision bias parameter p i , the actual decision of each participant is To satisfy the input constraint, the projection operator P Ω i (·) is used to provide the individual decision (16).
The above theorem derives a representation formula of the individual decisions under the free decision making condition. It not only attempts to optimize the individual task performance of each participant but also incorporates the decision bias into the individual decision making processes. Moreover, the representation formula of the individual decision vector is stated in the next corollary.
where Π = (Q, P) denotes the benchmark of the unknown parameters and function F is defined by with the projection operator P Ω (u) = arg miñ u∈Ω u −ũ .
Proof. The vector form (22) is naturally derived from (16) with standard matrix calculations and derivations.

Gradient Descent Machine Learning
In industry, it is normal practice to record historical production information. Suppose that the historical data are recorded over m sample times under the free decision making condition, and the data set Λ with m elements is generated and stored. Each element λ = (u, x, s) ∈ Λ contains the decision, the state and the production demand, which are linked by the function F as according to Corollary 1. However, Π is unknown in the general sense, and the esti-mateΠ = (Q,P) is employed in this paper to denote all unknown parameters within the function F (·). In this sense, an accurate estimateΠ definitely provides a decision u = F (Π, s, x)x), which approaches the real value u. This fact motivates the data-driven technique of machine learning to estimate the benchmark Π = (Q, P). The machine learning problem and its design objective are stated in the following definition.

Definition 2.
The machine learning problem aims at iteratively updating the estimateΠ such that the decision loss function is minimized for all λ ∈ Λ.
In the above problem definition, the decision loss denotes the difference between the decisionû computed from the estimateΠ and the recorded historical decision u. If the decision loss is sufficiently small, the estimated decisionû approximates u and the estimatê Π is considered as an approximation of the benchmark Π. Once an accurate estimateΠ is available, more reliable model information becomes possible for the later MPC procedure as it considers the values ofQ while solving the optimization problem.
Furthermore, the decision loss function of each participant is defined as the sum of which gives rise to the overall decision loss function; i.e., The widely used method of machine learning is an iterative training procedure with several training epochs and an initial estimatê This method randomly splits a total number of η elements from the data set Λ into a training set Λ train , and the rest are sorted into a testing set Λ test [46]. At each training epoch, it updates the estimateΠ using each element in the training set Λ train to reduce the decision loss (26). At the end of each training epoch, the decision loss (26) of the elements in the testing set Λ test , i.e., is computed to evaluate the training performance of the machine learning method; in other words, the decision loss, where the subscript j ∈ N denotes the training epoch number.
Since the testing set Λ test is independent of the training set Λ train , it is possible to use the evaluation results to double-check the training performance of the machine learning design using external data from the training set, which avoids the limitation of over-fitting. A gradient descent machine learning training procedure is proposed to reduce the decision loss (26) with respect to the elements in the training set Λ j train during each training epoch. The training procedure of each training element is illustrated in the following theorem: Theorem 2. For λ ∈ Λ j train , the estimateΠ is updated byΠ = (Q ,P ) using the gradient descent update such that the overall decision loss is reduced; i.e., where ∇q i L(·) and ∇p i L(·) are partial derivatives of (26) with respect toq i andp i , and γ i is the adaptive learning rate defined by where 0 < α < 1 and γ > 0 are constant scalars, and β i is the smallest non-negative integer such that Proof. Since the individual decision loss reduction of all participants in (34) is guaranteed by the adaptive learning rate choice (33), the overall decision loss naturally reduces according to (28).
This gradient descent machine learning training procedure aims at updating the estimateΠ to reduce the decision loss of all elements in the training set and obtains a reliable estimate of the unknown parameter values after sufficient training epochs.
Remark 2. K-fold cross-validation is recommended if the amount of historical data is relatively small. Different splits of the data enable the cross validation using all elements within the historical data set [47]. The adaptive learning rate choice (33) is a mechanism used to ensure both the maximum gradient descent step size and the reduction of the decision loss. Other alternative learning rate choices can be made using bisection or back-stepping methods.

Remark 3.
For each element λ ∈ Λ, there exist infinite choices forΠ to reduce its decision loss L(Π, λ) to zero, but there only exists a rather small range forΠ in the search space to obtain a sufficiently small overall decision loss. The trial and error method is used in (31) to reduce the decision loss of all λ ∈ Λ train in a random order, with the aim of minimizing the overall decision loss. However, its convergence performance along with the training number cannot always be guaranteed due to the independence of the elements in Λ; i.e., an update ofΠ may reduce L(Π, λ i ) but increase L(Π, λ i+1 ).

Remark 4.
Note that other non-iterative methods, such as echo-state networks, extreme learning machines or kernel ridge regression, are also capable of solving the regression design objective shown in Definition 2.

Remark 5.
In more general production planning problem, due to different system dynamics, the function F (·) may become complex/fuzzy. To handle the machine learning problem, the deep neural network model is suggested to be utilized to estimate the unknown values.

MPC Production Planning Problem
The machine learning procedure in Section 3.2 is capable of determining the unknown parameter values for the production planning problem. Therefore, an analytic representation of the problem formulation becomes available, and the regulator can make the optimal decisions for the participants to improve the overall task performance. To solve the production planning problem, the technique of MPC is applied at each sample time index k to update the optimal decisions over a future time interval [k, k + ∆k] by solving the optimization problem where x(i + 1|k) is the predicted state at sample time index i + 1 based on the real information x(k) at k and the planned input u(i) over the period [k, i]. At each sample time, MPC uses the current input u(k), the current state x(k) and the system dynamics to predict the future states x(i + 1|k), i = k, . . ., k + ∆k − 1, over the prediction time period [k + 1, k + ∆k]. An appropriate solution is chosen to optimize the cost function consisting of these predicted values and replace the previous decision (if it exists) with this new decision along with the future time period. The optimal decisions aim at maximizing the overall net profit of the collaborative manufacturing task, while the individual decisions focus on the net profit of an individual participant. A comparison between these decisions is made in Section 5 via a case study, and the benefits of using the optimal decisions are illustrated in details.

Implementation Instructions
The instructions are provided in this section for implementing the solutions in Section 3, and a comprehensive algorithm is then proposed to provide a solution to the production planning problem with a collaborative mode and unknown information.

Instructions on Projection Solution
According to the definition (22) of the projection operator P Ω (·), the unconstrained decision is projected into the given decision constraint set Ω, which generates the constrained decision. The decision saturation constraint (4) is element-wise and the constraint handling procedure is straightforward. The next proposition provides the analytic solution of the projection operator P Ω (·) with respect to the decision saturation constraint (4) as an example.

Proposition 1.
If the constraint set Ω is in the form of the input saturation constraint (4), the analytic solution of the projection operator P Ω (·) defined in (22) is given by Proof. As the input saturation constraint (4) is element-wise, the projection operator P Ω is computed by solving the problem for i = 1, . . ., n, which yields the analytic solution (36).
The above proposition provides the analytic solution of P Ω (·) based on the input saturation constraint (4). The solutions of the projection operator P Ω (·) with other types of input constraints can be handled in a similar way.

Instructions on Initial Estimate Choice
The machine learning training procedure is carried out in an iterative manner, and the decision loss function in (26) is generally non-linear and non-convex. Although the gradient descent training procedure (31) guarantees the reduction of the decision loss for the various possible initial estimate choicesΠ 0 , the initial estimate choice definitely affects the training performance. The estimate might converge to a local minimum point with a relatively large decision loss if the initial estimate is not chosen properly.
To achieve the desired training performance, the estimate search space is predefined as where the scalars q i , p i , q i and p i , i = 1, . . ., n, are appropriately chosen as the upper and lower bounds of the search space. In this sense, two different initial estimate choices are introduced, and the first one is defined by Definition 3. The central initial estimate choice is used to specify all initial values as the central values of their intervals; i.e., This choice ensures the initial distance to the benchmark Π does not appear to be too large and provides a relatively fair initial choice. Alternatively, the second choice is defined by Definition 4. The grid initial estimate choice is used to specify the initial estimate choice as the best fitting solution to the grid search problem whereΘ is a discretized finite subset of Θ defined by with sample numbers n q and n p .
This word "grid" in the name of this choice implies that the sample numbers n q and n p should not be too large so that the total number of elements inΘ would not be excessive [48]. Therefore, the grid search problem (40) can be computed efficiently, capturing an approximate region of the global optimal solution to the minimum decision loss problem: min However, this choice requires extra computational time to carry out the grid search as the number of the unknown parameters increases. Remark 6. These two initial choices both provide reliable initial values for the estimate for the gradient descent training procedure (31). However, a trade-off between computational efficiency and training performance improvement should be taken into account for practical implementation.

Instructions for Partial Derivative Estimation
Since the decision loss function L(·) has an analytic form (26), the value of its partial derivatives with respect toq i andp i can be computed using the syntax gradient in Matlab or autograd in Python. Alternatively, a numerical estimation is used to obtain the partial derivative, wherê Q + i = diag(q 1 ,q 2 , . . . ,q i + δ, . . . ,q n ), Q − i = diag(q 1 ,q 2 , . . . ,q i − δ, . . . ,q n ), and the scalar δ ∈ N is sufficiently small. The partial derivative ∇p i L(·) can be obtained in a similar way. This numerical estimation incurs less computational load and is still capable of giving approximate values of the derivatives even if the analytic representation of the decision loss function L(·) is not available in some extreme cases.

Instructions for MPC Problem Solution
The MPC problem (35) has a quadratic cost function, linear equality constraint and convex inequality constraint. Therefore, it is a quadratic programming problem, which is a convex optimization problem with a global optimal solution. There exist various standard solvers for quadratic programming problems, such as the solver quadprog in MATLAB. These standard solvers can directly provide the global optimal solution.
Furthermore, an alternative computational efficient solution is obtained by decoupling the input constraint from the optimization problem (35) and setting the prediction interval to a unit sample time; i.e., ∆k = 1. The solution to the problem (35) is described as first solving the unconstrained problem min u 1 2 x and projecting the solution into the input constraint set Ω using P Ω (·), which is illustrated as follows: If the input constraint is decoupled from the problem (35) and ∆k = 1, its alternative solution is u(k) = P Ω (ũ(k)), (45) whereũ(k) is the solution of (44); i.e., u(k) = (BQB + R) −1 (Bs(k + 1) − BQAx(k)).
Proof. As the input constraint is decoupled and ∆k = 1, the problem (35) collapses to the problem (44). The problem (44) becomes an unconstrained problem by substituting the equality constraint x(k + 1|k) = Ax(k) + Bu into the cost function, which yields the solution (46). To handle the input constraint, the projection operator P Ω (·) is considered to give the solution (46).
The alternative solution to problem (35) in Proposition 2 is computationally more efficient than the direct solution provided by the standard solvers. A trade-off between efficiency and accuracy should be considered by the production regulator according to the properties of the exact production planning scenario.

Production Planning Comprehensive Algorithm
The techniques of machine learning and MPC are combined to yield a comprehensive algorithm (Algorithm 1) to solve the production planning problem with a collaborative mode and unknown information. This algorithm first uses the gradient descent machine learning procedure to estimate the unknown parameters and applies MPC to obtain the optimal decisions for the participants at each sample time index k. Note that the scalar value N denotes the total number of sample times and > 0 is a small scalar depending on the accuracy requirement of machine learning estimation. For a straightforward view, a block diagram of the machine learning-based MPC algorithm is shown in Figure 1. (1), cost function f (u, x, k), production demand s(k), initial state
1: initialization: Set training epoch number j = 1, sample time index k = 0 and randomly split η elements of the set Λ to a training set Λ train and the rest to a testing set Λ test .
Perform the training procedure (31) to update the estimateΠ using all elements in Λ train .

4:
Perform the evaluation (30) to compute the decision loss ι j i , using all elements in Λ test .

5:
Set j = j + 1 to the next training epoch. 6: end while 7: Set the estimateQ as the unknown matrix Q. 8: while k < N do 9: Solve the MPC problem (35) to obtain u(k).

10:
Record the optimal decisions u(k).

Simulation-Based Case Study
In this section, a case study of a production planning problem is presented, and an evaluation is conducted to check the performance of Algorithm 1 in comparison to the performance under the conditions of individual decision making and pure MPC decision making.

Problem Design Specifications
In this paper, we consider the production planning scenario within a clothing factory as a case study. This factory has a total number of 10 departments (n = 10; i.e., 10 participant) working in a collaborative mode. The modified dynamics of the discrete time system (1) of this factory are considered with  The above system dynamics demonstrate how the productivity decision of an identical participant affects the market size of other participants. The sample time period is one day, which means the regulator needs to plan productivity decisions one day in advance. The total number of sample times N is 365; i.e., one year. The initial state value is chosen as x(0) = 0, and the input saturation constraint (4)  as well as a standard deviation of σ i = 0.01. Since the parameters in the matrix Q are unknown to the regulator, the authors only apply these values as a benchmark to generate historical data and evaluate the estimation accuracy of the machine learning procedure. The same argument holds for the unknown matrix P.
In addition, the production demand of each department over the total decision making period [0, 365] is described as where the amplitude κ i , center amplitude τ i and the phase ϕ i are the elements of the vectors as follows: This case study focuses on clothing manufacturing, so the production demand varies for different seasons in a calendar year. In this sense, the sinusoidal function (47) can reasonably represent the production demands of these departments. These seasonally varying demands contribute to the historical demand within the set Λ and are used to compute the optimal decisions in Sections 5.2 and 5.3.
These problem design specifications fully replicate the industrial environment of cloth manufacturing. Therefore, the simulation results are capable of indicating the practical effectiveness and feasibility of the proposed algorithm.

Parameter Estimation Using Machine Learning
To obtain the historical data, the following data sampling procedure was employed: first of all, the benchmark matrices Q and P defined in Section 5.1 generated the individual decisions using the Formula (22) over a production planning period [1,365], denoting a calendar year. Therefore, the historical data set Λ was established with a total number of 365 groups of historical data; i.e., λ i = (s i , u i , x i ), i = 1, . . ., 365 (m = 365). In this case study, the groups of data generated as above replicated the interaction between the production demands, individual decisions and market sizes of each department under the free decision making condition and were thus regarded as effective historical data for the later machine learning procedure.
The machine learning procedure of Algorithm 1 randomly split a total number of 100 (η = 100) groups of the historical data into a training set Λ train and the rest into a testing set Λ test . Then, it was capable of estimating the unknown parameter values of Q and P based on the obtained training set Λ train . The grid initial estimate choice in Definition 4 was carried out in the search space (38) defined by for i = 1, . . ., 10. The values α = 0.6 and γ = 0.05 were used to determine the adaptive learning rate choice (33). The decision loss during the training procedure was plotted along the training number line for each department in Figure 2 and all asymptotically converged to sufficiently small values. Meanwhile, the decision loss of each department was computed using the testing data at the end of each training epoch and plotted as the colored lines in Figure 3. Note that all lines in this figure also asymptotically converged to sufficiently small values. This means that the estimated decisionû = F (Π, s, x) gradually approached the real decision u, and thus the training procedure succeeded in providing an appropriate estimateΠ. The above evaluation demonstrates the feasibility and reliability of the machine learning procedure in terms of validation using testing data. The choice of the initial estimate affected the convergence performance of the 10 departments. For instance, the estimated values of the fourth department were sufficiently close to the benchmark values, and the convergence rate was much slower along the training epochs in comparison to others. Furthermore, the updated estimateΠ was compared with the benchmark values for each training update, and the results of the first department are plotted in Figure 4. For comparison, the corresponding benchmark values are plotted in the same figure as the dashed magenta lines (the mean value µ 1 of p i is used for comparison). It is observed that the estimated values asymptotically converged to their benchmark values, which means that the converged estimate matched well with its benchmark. The converged estimates werê    In practice, the historical data of the production information should be recorded over a past time period rather than artificially generated using the benchmark, as in this case study. The performance of the machine learning procedure should be evaluated based on the convergence performance of the decision loss with respect to the testing data.

Decision Making Using MPC
Since the machine learning procedure in Section 5.2 provided an accurate estimateΠ of the unknown parameter values, the information within the production problem made it feasible for the regulator to make the optimal decisions. The MPC decision making procedure of Algorithm 1 was performed for the production planning problem specified in Section 5.1 with a unit prediction interval (∆T = 1), using the given system dynamics (1) and the desired cost function (8).
The performance of the MPC decision making procedure was evaluated over the whole production planning period [1,365], and the overall net profits of the factory at each sample time index are plotted as the magenta curve in Figure 5. Note that the profit in winter was approximately 30% higher than that in summer. The net profit caused by the individual decisions was computed for the same production planning problem, and the pure MPC approach was utilized for this problem with an inaccurate estimate of a ±50% random variance to the benchmark values. These results are also plotted in this figure as the blue and red curves, respectively. These results suggest that the machine learning (ML)-based MPC decisions are able to increase the profit by over 10% compared to the profit of the individual biased decisions. In comparison to the pure MPC decisions, it fully employs the technique of machine learning to obtain an accurate estimate of unknown parameters, increasing the annual profit by around 5%.
Furthermore, the optimal decisions of each department are plotted as the colored curves in Figure 6. Note that the optimal decisions were all within the saturation range [0, 1.5], which means that Algorithm 1 could handle the input constraint appropriately. Furthermore, the optimal decisions had a causal relationship with the production demand (47). In other words, a higher production demand encourages a larger decision value (production effort). From the simulation results, Algorithm 1 can be seen to provide the optimal decisions within the constraint at each sample time index to increase the net profit. Therefore, Algorithm 1 achieved the desired practical benefits in solving the production planning problem. Note that the authors employed the alternative solution in Proposition 2 to solve problem (35), providing similar results to those obtained from quadprog.

Conclusions and Future Work
This paper focuses on the exact production planning problem with a collaborative mode and unknown information and proposes a machine learning-based MPC algorithm to solve this problem. This problem is defined using a mathematical form using a discrete time system and a net profit cost function. This paper records the historical data under the free decision making condition with a certain degree of decision bias to establish a dataset. A gradient descent machine learning procedure is proposed to estimate the unknown parameter values based on the elements in the dataset, and MPC uses the estimated parameters to make the optimal decisions by solving a quadratic optimization problem at each sample time index. These procedures together yield a comprehensive algorithm for this specific class of production planning problems. The algorithm is validated using a simulation-based case study to check its parameter estimation accuracy and task performance. In addition, a comparison with individual decision making and pure MPC decision making is conducted to show its advantages in terms of increasing profit.
Although the efficiency and effectiveness of the proposed algorithm are shown in this paper, some potential extensions can be made to strengthen the generality and reliability of this method. First of all, a general non-linear system model can be used to broaden the generality to estimate unknown information for other control methods, such as iterative learning control [49,50], and a target task performance other than the net profit is suggested to be considered to achieve alternative benefits. Moreover, a research study on a system with complex dynamic interactions among the participants and associated applicationse.g., a large scale non diagonal Q matrix-are under consideration for future works. Last but not least, other alternative frameworks, such as a real-time optimization framework and Economic MPC, will be further explored to confirm the wide applications of the proposed algorithm in the area of production planning.