An Actor-Critic Algorithm for the Stochastic Cutting Stock Problem
Abstract
:1. Introduction
2. Materials and Methods
2.1. Problem Statement
2.2. Reinforcement Learning
2.3. Advantage Actor-Critic
2.3.1. Discrete Action Space
2.3.2. Continuous Action Space
2.4. Environment
- The state , represents the inventory of item i (i {1, 2, …, 7}). The initial state (i.e., the initial inventory) follows a discrete uniform probability distribution between 0 and 35, which can be formulated as:
- An action denoted as represents the number of times to apply pattern j.
- To estimate the reward, the cost is first calculated using Equation (15), where is the trim loss of pattern j, the value of which is provided in Table 2, and [x]+ = max(0,x). The values and definitions of and are provided in Table 4. The cost consists of the trim loss, inventory cost, and back-order cost. The reward is a function of the cost and is calculated by dividing the cost by 500, as shown in Equation (16). This reward function indicates that when the cost is greater than 500, a reward of less than one is returned, and when the cost is equal to 500, a reward with a value of one is returned. The reward function was designed in this manner for the following reasons:
- Based on the results presented in [7], a cost of approximately 500 is sufficiently low for this SCSP example.
- The training of the critic and actor networks may be unstable if the variance of the loss is large. Therefore, the reward should be approximately equal to one.
- There are two constraints on actions in the target environment system, as described below. First, the inventory of any item at any time cannot exceed the maximum inventory. Second, the total number of patterns to be used must be less than the number of available stock materials.
- The value of the variable depends on whether action violates the constraints (Equation (18)). The takes on a value of one if the current episode ends or a value of zero if the episode continues, and then the state is updated by the state transition function (Equation (13)). We adopted the game concept of an episode ending when an action violates any constraint and continuing when all constraints are satisfied to improve the ability of the model to deal with the constraints. For a discount factor close to one, the greater the number of future steps, the higher the accumulated reward, resulting in a training target that continuously provides actions that meet constraints.
2.5. Proposed Method
2.5.1. Two-Stage Discount Factor
- Avoiding violating the constraints;
- Minimizing cost.
2.5.2. Proposed Process
- Interaction for data collection (solid line in Figure 3): At the beginning of an episode, the initial state is sampled and observed by the actor, and the actor then outputs the mean value and variance for sampling an action from a Gaussian distribution. After executing a sampled action, the environment returns the next state, reward, and whether the episode is finished. If the episode is finished, indicating that the constraints are violated, then the next episode starts, and an initial state is sampled. Otherwise, the next state is observed by the actor to take the next action.
- Training of the critic and actor (green dashed line in Figure 3): After each of the 32 steps of interaction between the actor and environment, the total loss, which is composed of critic, policy, and entropy losses (Equation (20)), is calculated and used to update both the actor and critic networks. It should be noted that if the episode ends at time step t, the state value of time step t + 1 must be zero. Otherwise, training will not converge. Furthermore, we propose a two-stage discount factor algorithm, as shown in the orange box in Figure 3. Once the average number of steps over the previous hundred training episodes is greater than 200, the discount factor is adjusted to 0.1. Before this point, the discount factor is 0.9.
Algorithm 1. Proposed A2C Model for a Continuous Action Space with a Two-stage Discount Factor |
Initialize actor network πθ and critic network Qθ with random parameters θ Input = 1 × 10−4, B = 32 Initialize discount factor γ = 0.9 Initialize memory M Initialize total game steps , total rewards , total cost = 0 for each episode: Initialize the initial state s0 While: Get 𝜇t, 𝜎2t = πθ(st) Take an action at = int(sample from a Gaussian distribution with mean 𝜇t and variance 𝜎2t) Execute action at and observe reward rt, next state st+1, and donet Store (st, at, rt, st+1, donet) in M += 1 += += /500 Update state st ← st+1 if donet: Calculate the average number of steps and mean cost over the last 100 episodes if 200: γ = 0.1 if the number of data in M = B: Calculate the total loss by Equation (20). Update critic and actor by minimizing the total loss end for |
3. Results and Discussion
3.1. Training
3.1.1. Average Number of Steps
3.1.2. Mean Cost
3.2. Testing
- Ability to satisfy the constraints: The results for the average number of steps demonstrate the excellent ability of our proposed model to satisfy the constraints without excessive random iterations used in the literature to ensure that the actions satisfy the constraints.
- Repeatability, practicality, and short training time: Our method has a small number of hyperparameters to be tuned, which makes it repeatable, practical, and easy to train. Furthermore, the proposed two-stage discount factor algorithm can reduce training time.
- Intelligibility: We proposed a two-stage discount factor algorithm to adjust the hyperparameter of the A2C model dynamically, so our proposed model is based on a common RL method, making it simple to understand and implement.
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Gilmore, P.C.; Gomory, R.E. A linear programming approach to the cutting-stock problem. Oper. Res. 1961, 9, 849–859. [Google Scholar] [CrossRef] [Green Version]
- Israni, S.; Sanders, J. Two-dimensional cutting stock problem research: A review and a new rectangular layout algorithm. J. Manuf. Syst. 1982, 1, 169–182. [Google Scholar] [CrossRef]
- Cheng, C.H.; Feiring, B.R.; Cheng, T.C.E. The cutting stock problem—A survey. Int. J. Prod. Econ. 1994, 36, 291–305. [Google Scholar] [CrossRef]
- Krichagina, E.V.; Rubio, R.; Taksar, M.I.; Wein, L.M. A dynamic stochastic stock-cutting problem. Oper. Res. 1998, 46, 690–701. [Google Scholar] [CrossRef] [Green Version]
- Alem, D.J.; Munari, P.A.; Arenales, M.N.; Ferreira, P.A.V. On the cutting stock problem under stochastic demand. Ann. Oper. Res. 2010, 179, 169–186. [Google Scholar] [CrossRef]
- Ikonen, T.J.; Heljanko, K.; Harjunkoski, I. Reinforcement learning of adaptive online rescheduling timing and computing time allocation. Comput. Chem. Eng. 2020, 141, 106994. [Google Scholar] [CrossRef]
- Pitombeira-Neto, A.R.; Murta, A.H. A reinforcement learning approach to the stochastic cutting stock problem. EURO J. Comput. Optim. 2022, 10, 100027. [Google Scholar] [CrossRef]
- Gu, S.; Hao, T.; Yao, H. A pointer network based deep learning algorithm for unconstrained binary quadratic programming problem. Neurocomputing 2020, 390, 1–11. [Google Scholar] [CrossRef]
- Sur, G.; Ryu, S.Y.; Kim, J.; Lim, H. A Deep Reinforcement Learning-Based Scheme for Solving Multiple Knapsack Problems. Appl. Sci. 2022, 12, 3068. [Google Scholar] [CrossRef]
- Hubbs, C.D.; Li, C.; Sahinidis, N.V.; Grossmann, I.E.; Wassick, J.M. A deep reinforcement learning approach for chemical production scheduling. Comput. Chem. Eng. 2020, 141, 106982. [Google Scholar] [CrossRef]
- Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef] [Green Version]
- Zhu, L.; Cui, Y.; Takami, G.; Kanokogi, H.; Matsubara, T. Scalable reinforcement learning for plant-wide control of vinyl acetate monomer process. Control Eng. Pract. 2020, 97, 104331. [Google Scholar] [CrossRef]
- Shao, Z.; Si, F.; Kudenko, D.; Wang, P.; Tong, X. Predictive scheduling of wet flue gas desulfurization system based on reinforcement learning. Comput. Chem. Eng. 2020, 141, 107000. [Google Scholar] [CrossRef]
- Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef] [Green Version]
- Peng, B.; Li, X.; Gao, J.; Liu, J.; Chen, Y.N.; Wong, K.F. Adversarial advantage actor-critic model for task-completion dialogue policy learning. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 6149–6153. [Google Scholar]
- Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1928–1937. [Google Scholar]
Item | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
Length(cm) | 115 | 180 | 267 | 314 | 880 | 1180 | 1200 |
Item\Pattern | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 10 | 13 | 3 | 3 | 2 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 1 | 1 | 2 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 2 | 2 | 3 | 0 |
4 | 1 | 0 | 0 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3 | 2 | 4 |
5 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Trim loss (cm) | 36 | 5 | 95 | 33 | 30 | 70 | 5 | 25 | 33 | 53 | 39 | 86 | 24 | 71 | 64 |
Item | 1 | 2 | 3 | 4 | 5 | 6 | 7 | dmin | dmax |
---|---|---|---|---|---|---|---|---|---|
Probability (p) | 0.3 | 0.2 | 0.2 | 0.1 | 0.1 | 0.05 | 0.05 | 40 | 50 |
Parameter | Value | Statement |
---|---|---|
0.01 | Inventory holding cost per item, where is the length of item i. | |
Back-order cost per item, which is the cost of not satisfying the demand. | ||
70 | Maximum inventory for each item at one time. | |
30 | Number of available stock material at one time. |
Average Number of Steps | Mean Cost | |
---|---|---|
Two-stage discount factor | 24,419 | 462.75 |
Discount factor = 0.9 | 342 | 1263.45 |
Discount factor = 0.1 | 24 | 986.5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Su, J.-Y.; Kang, J.-L.; Jang, S.-S. An Actor-Critic Algorithm for the Stochastic Cutting Stock Problem. Processes 2023, 11, 1203. https://doi.org/10.3390/pr11041203
Su J-Y, Kang J-L, Jang S-S. An Actor-Critic Algorithm for the Stochastic Cutting Stock Problem. Processes. 2023; 11(4):1203. https://doi.org/10.3390/pr11041203
Chicago/Turabian StyleSu, Jie-Ying, Jia-Lin Kang, and Shi-Shang Jang. 2023. "An Actor-Critic Algorithm for the Stochastic Cutting Stock Problem" Processes 11, no. 4: 1203. https://doi.org/10.3390/pr11041203
APA StyleSu, J.-Y., Kang, J.-L., & Jang, S.-S. (2023). An Actor-Critic Algorithm for the Stochastic Cutting Stock Problem. Processes, 11(4), 1203. https://doi.org/10.3390/pr11041203