Increasing the Energy-Efﬁciency in Vacuum-Based Package Handling Using Deep Q-Learning

: Billions of packages are automatically handled in warehouses every year. The gripping systems are, however, most often oversized in order to cover a large range of different carton types, package masses, and robot motions. In addition, a targeted optimization of the process parameters with the aim of reducing the oversizing requires prior knowledge, personnel resources, and experience. This paper investigates whether the energy-efﬁciency in vacuum-based package handling can be increased without the need for prior knowledge of optimal process parameters. The core method comprises the variation of the input pressure for the vacuum ejector, compliant to the robot trajectory and the resulting inertial forces at the gripper-object-interface. The control mechanism is trained by applying reinforcement learning with a deep Q-agent. In the proposed use case, the energy-efﬁciency can be increased by up to 70% within a few hours of learning. It is also demonstrated that the generalization capability with regard to multiple different robot trajectories is achievable. In the future, the industrial applicability can be enhanced by deployment of the deep Q-agent in a decentral system, to collect data from different pick and place processes and enable a generalizable and scalable solution for energy-efﬁcient vacuum-based handling in warehouse automation.


Introduction
Vacuum-based handling is used in a large variety of applications, especially when a high flexibility is required due to diverse objects that must be grasped, e.g., in packaging and warehouse logistics. These fields of application are constantly gaining relevance, as global retail e-commerce sales amounted to 4.3 trillion US dollars in 2020 and revenues are estimated to grow to 6.4 trillion dollars until 2024 [1]. Current vacuum-based gripping systems for package handling are mostly realized by means of compressed air-supplied vacuum ejectors and therefore exhibit a highly dynamic and wear-free operation. However, using vacuum ejectors causes enormous energy losses [2,3], since only a few percent of the initially provided electrical energy can be utilized for the handling process ( Figure  1). Hence, it is crucial to design the gripping system and the corresponding process parameters in compliance with the application-specific requirements such as the expected robot trajectories and the properties of the objects to be handled. Since handling tasks are not value-added [4], in industrial practice, it is crucial to setup the system and process fast. In particular, in case of a large spectrum of objects to be handled, it is usually not economically feasible to adjust the process parameters for each specific trajectory and object. Hence, in practice, a universally applicable standard system is set up which is oversized for most of the expected objects, but will eventually provide a robust handling process.
These standard systems are normally dimensioned in accordance with basic calculation schemes, under consideration of the most prevalent load case [5]. Based on the maximum expected load, aggregated by gravitational and inertial forces, the required Figure 1. Energy conversion losses in compressed air-based vacuum generation. [2] These standard systems are normally dimensioned in accordance with basic calculation schemes, under consideration of the most prevalent load case [5]. Based on the maximum expected load, aggregated by gravitational and inertial forces, the required number and size of vacuum grippers are selected, as well as the necessary pressure difference and a sufficiently powerful ejector type. Due to manifold uncertainties such as environmental conditions (temperature, humidity, contamination), exact object properties (carton composition, mass) and gripper behavior (deformation, sealing capabilities), a certain safety margin is finally applied in order to oversize the calculated system and process parameters. Once these parameters are set, the possibility of online adjustments during the running process is limited to variation of the input pressure of compressed air, which the vacuum ejector is supplied with. However, it would require a disproportionally high effort to analyze every possible load scenario by hand and prior to getting the handling process running.
Aiming at improving the energy-efficiency in vacuum-based handling, several papers can be found on development of improved or novel vacuum grippers. Most common approaches focus on the integration of shape memory alloy in order to actively control the adaptation capabilities of single suction cups [6][7][8][9]; other researchers develop biomimetic vacuum grippers [10], origami-inspired [11], or electrically actuated grippers [12,13]. Extensive work can be found on mathematical modeling of vacuum grippers with the objective of a more precise system and/or process dimensioning. Basis static modeling of vacuum grippers is conducted in [11,[13][14][15][16][17][18][19][20][21], dynamic model approaches are presented in [2,[22][23][24]. A few publications can be found that focus on finite element analysis and, based on that, design optimization of vacuum grippers [25,26]. Another field concentrates on finding the optimal gasp points of multiple vacuum grippers on the part surface [27][28][29][30]. With regard to fluid dynamics and acoustics, several publications on optimized vacuum ejectors and enhanced air-saving functionalities are present, as well [31][32][33]. The majority of the here-presented related work focuses on grasping air-impermeable objects and is therefore not directly applicable to package handling. In the context of carton or package handling, vacuum grippers are rather utilized as supportive elements in flexible gripping systems for depalletizing [34,35]. Methods specializing on improving the energyefficiency for vacuum-based handling of air-permeable objects such as packages or textiles can hardly be found in literature. The application of machine learning methods in vacuum-based handling is fairly limited in literature. In [36], an adaptive vacuum Aiming at improving the energy-efficiency in vacuum-based handling, several papers can be found on development of improved or novel vacuum grippers. Most common approaches focus on the integration of shape memory alloy in order to actively control the adaptation capabilities of single suction cups [6][7][8][9]; other researchers develop biomimetic vacuum grippers [10], origami-inspired [11], or electrically actuated grippers [12,13]. Extensive work can be found on mathematical modeling of vacuum grippers with the objective of a more precise system and/or process dimensioning. Basis static modeling of vacuum grippers is conducted in [11,[13][14][15][16][17][18][19][20][21], dynamic model approaches are presented in [2,[22][23][24]. A few publications can be found that focus on finite element analysis and, based on that, design optimization of vacuum grippers [25,26]. Another field concentrates on finding the optimal gasp points of multiple vacuum grippers on the part surface [27][28][29][30]. With regard to fluid dynamics and acoustics, several publications on optimized vacuum ejectors and enhanced air-saving functionalities are present, as well [31][32][33]. The majority of the here-presented related work focuses on grasping air-impermeable objects and is therefore not directly applicable to package handling. In the context of carton or package handling, vacuum grippers are rather utilized as supportive elements in flexible gripping systems for depalletizing [34,35]. Methods specializing on improving the energy-efficiency for vacuum-based handling of air-permeable objects such as packages or textiles can hardly be found in literature. The application of machine learning methods in vacuum-based handling is fairly limited in literature. In [36], an adaptive vacuum monitoring and control system for a passive vacuum generation mechanism is realized by means of a deep Q-agent (DQA). Mahler et al. predict the robustness of a vacuum suction grasp via an analytic model and a deep Q-learning (DQL) algorithm [28]. With regard to bin picking, however, a huge body of research is present. Based on image data, the most feasible grasp pose is estimated in order to pick objects from a bin [37][38][39][40][41]. Clearly, the application of reinforcement learning (RL) in vacuum-based handling has already shown great potential for technological improvements. In summary, there are a number of methods for increasing energy efficiency in vacuum-based handling, but gripping air-permeable objects is a major challenge that can be improved using DQL. Therefore, the aim of this paper is to investigate whether the energy-efficiency in vacuum-based package handling can be increased without the need for prior knowledge of optimal process parameters. The core idea is to realize the adaptive variation of the input pressure for the vacuum ejector, compliant to the robot trajectory and the resulting inertial forces at the gripper-object-interface, based on a DQL approach. With this approach, extensive prior knowledge about process-specific parameters that are advantageous for both a robust and an efficient handling process can be made obsolete. In several subsequent training episodes, a deep Q-agent trains to predict the impact of a certain pressure profile in combination with a specific robot trajectory (the object is not varied). This paper examines how fast and to what extent the DQA learns to improve the energy efficiency and how well it is able to generalize with regard to different trajectories. It is demonstrated that energy savings of 50% to almost 70% can be achieved within an hour of iterative experiments. It is also shown that a good generalization capability of the DQA with regard to multiple different robot trajectories can be achieved.
The paper is structured as follows. Section 2 discusses materials and methods, i.e., the utilized experimental setup as well as the underlying method and the implementation of the proposed DQA approach. The results of the conducted experimental case study are demonstrated in Section 3. Finally, in Section 4, the results are discussed with regard to industrial applicability and scalability and conclusions for further research are drawn.

Materials and Methods
In this section, the regarded use case is firstly presented and the corresponding experimental setup is introduced. Subsequently, the conceptual design and implementation of the deep Q-agent is elaborated in combination with the process control system.

Use Case Definition for Vacuum-Based Package Handling and Experimental Setup
An industrially relevant use case was initially defined in order to create a realistic set of process requirements in accordance to typical package handling processes in warehouse automation. In addition, the transferability of the obtained findings into industrial applications can thus be estimated. In the scope of the underlying research project BiVaS, an experimental robot-based setup is available at Open Hybrid LabFactory (OHLF) in Wolfsburg, Germany. This robot-based setup (Figure 2a), is supplied with centrally generated compressed air and mainly comprises a vacuum ejector, a proportional valve for variation of the input pressure, a gripping system, and a distance sensor for detection of object presence. The gripping system was supplied by J. Schmalz GmbH and consists of four identical vacuum grippers of type SPB1 60 ED-65. The process control is realized as soft PLC on a control PC (Beckhoff TwinCAT 3). Figure 2b shows the target positions of the realized use case. For the robot movement from start to end position, a maximum duration of 2.5 s was defined in accordance with typical industrial handling applications with about 10-15 picks per minute (information provided by J. Schmalz GmbH, Glatten, Germany). application of reinforcement learning (RL) in vacuum-based handling has already shown great potential for technological improvements. In summary, there are a number of methods for increasing energy efficiency in vacuum-based handling, but gripping airpermeable objects is a major challenge that can be improved using DQL. Therefore, the aim of this paper is to investigate whether the energy-efficiency in vacuum-based package handling can be increased without the need for prior knowledge of optimal process parameters. The core idea is to realize the adaptive variation of the input pressure for the vacuum ejector, compliant to the robot trajectory and the resulting inertial forces at the gripper-object-interface, based on a DQL approach. With this approach, extensive prior knowledge about process-specific parameters that are advantageous for both a robust and an efficient handling process can be made obsolete. In several subsequent training episodes, a deep Q-agent trains to predict the impact of a certain pressure profile in combination with a specific robot trajectory (the object is not varied). This paper examines how fast and to what extent the DQA learns to improve the energy efficiency and how well it is able to generalize with regard to different trajectories. It is demonstrated that energy savings of 50% to almost 70% can be achieved within an hour of iterative experiments. It is also shown that a good generalization capability of the DQA with regard to multiple different robot trajectories can be achieved.
The paper is structured as follows. Section 2 discusses materials and methods, i.e., the utilized experimental setup as well as the underlying method and the implementation of the proposed DQA approach. The results of the conducted experimental case study are demonstrated in Section 3. Finally, in Section 4, the results are discussed with regard to industrial applicability and scalability and conclusions for further research are drawn.

Materials and Methods
In this section, the regarded use case is firstly presented and the corresponding experimental setup is introduced. Subsequently, the conceptual design and implementation of the deep Q-agent is elaborated in combination with the process control system.

Use Case Definition for Vacuum-Based Package Handling and Experimental Setup
An industrially relevant use case was initially defined in order to create a realistic set of process requirements in accordance to typical package handling processes in warehouse automation. In addition, the transferability of the obtained findings into industrial applications can thus be estimated. In the scope of the underlying research project BiVaS, an experimental robot-based setup is available at Open Hybrid LabFactory (OHLF) in Wolfsburg, Germany. This robot-based setup (Figure 2a), is supplied with centrally generated compressed air and mainly comprises a vacuum ejector, a proportional valve for variation of the input pressure, a gripping system, and a distance sensor for detection of object presence. The gripping system was supplied by J. Schmalz GmbH and consists of per minute (information provided by J. Schmalz GmbH, Glatten， Germany).  The straightforward approach to save energy in this specified use case is to reduce the input pressure via the proportional valve and therefore decrease the compressed air consumption. In case of setting the pressure too low, the package may fall off the gripping system. In order to evaluate whether the package has lost contact with the grippers, a distance sensor is integrated into the gripping system (Figure 3a). If a certain threshold value for the measured distance is exceeded, the package is considered fallen off. In order to enable fully automated tests, the package was fixed to the gripping system with belts ( Figure 3b). This makes it possible for the robot to carry the package back to the start position and start over the handling process, regardless of the process success.
acceleration profiles in X-and Z-component (in the scope of this work, the robot path is regarded two-dimensional). Specifically, six different robot paths were created by first defining a spline-based path with subsequent variation of the support points, and then applied with three different execution speeds each. Figure 4 shows the created paths in Xand Z-components for all 18 trajectories T1 to T18. The objective of the experimental case study is to evaluate as a first step if a trained DQA is able to increase the energy efficiency of this handling set-up. As a second step, the quality of the DQA is analyzed with respect to the ability to reduce the energy consumption for each separate trajectory and, in comparison, how well it is capable of acceleration profiles in X-and Z-component (in the scope of this work, the robot path is regarded two-dimensional). Specifically, six different robot paths were created by first defining a spline-based path with subsequent variation of the support points, and then applied with three different execution speeds each. Figure  4 shows the created paths in X-and Z-components for all 18 trajectories T1 to T18. The objective of the experimental case study is to evaluate as a first step if a trained DQA is able to increase the energy efficiency of this handling set-up. As a second step, the quality of the DQA is analyzed with respect to the ability to reduce the energy consumption for each separate trajectory and, in comparison, how well it is capable of (a) (b) Based on the process specification (Figure 2b), 18 different trajectories were created in order to generate multiple diverse acceleration profiles in X-and Z-component (in the scope of this work, the robot path is regarded two-dimensional). Specifically, six different robot paths were created by first defining a spline-based path with subsequent variation of the support points, and then applied with three different execution speeds each. Figure  4 shows the created paths in X-and Z-components for all 18 trajectories T1 to T18. The objective of the experimental case study is to evaluate as a first step if a trained DQA is able to increase the energy efficiency of this handling set-up. As a second step, the quality of the DQA is analyzed with respect to the ability to reduce the energy consumption for each separate trajectory and, in comparison, how well it is capable of generalizing and transferring the learned mechanism to unknown trajectories. Based on the process specification (Figure 2b), 18 different trajectories were created in order to generate multiple diverse acceleration profiles in X-and Z-component (in the scope of this work, the robot path is regarded two-dimensional). Specifically, six different robot paths were created by first defining a spline-based path with subsequent variation of the support points, and then applied with three different execution speeds each. Figure 4 shows the created paths in X-and Z-components for all 18 trajectories T 1 to T 18 . The objective of the experimental case study is to evaluate as a first step if a trained DQA is able to increase the energy efficiency of this handling set-up. As a second step, the quality of the DQA is analyzed with respect to the ability to reduce the energy consumption for each separate trajectory and, in comparison, how well it is capable of generalizing and transferring the learned mechanism to unknown trajectories.

Deep Q-Learning Implementation and Process Control Architecture
Reinforcement Learning (RL) is one of three main categories of Machine Learning besides Supervised Learning, which is primarily applied for classification and regression,

Deep Q-Learning Implementation and Process Control Architecture
Reinforcement Learning (RL) is one of three main categories of Machine Learning besides Supervised Learning, which is primarily applied for classification and regression, and Unsupervised Learning, most often used for cluster recognition and data compression [43]. RL aims to train an agent in making decisions in order to reach a certain defined objective by maximizing the reward that depends on the outcome of each episode (and each step inside one episode). The agent interacts with the environment by performing actions based on observations of states. At each time step t, the system is in a state s t . Originating from this state, the agent picks a certain action a t from the space of possible actions A. This action leads to a new state s t+1 which is associated with a reward r t+1 . For a given number of episodes, the agent learns a strategy to decide on the optimal state-specific action, called policy. The core idea of RL is based on describing the regarded optimization problem as Markov-Decision-Process (MDP) problem, which assumes a finite number of states and actions, where each subsequent state depends only on the current state [44]. One established approach to solve the MDP problem is Q-learning which is a model-free method based on Temporal Difference Learning. Q-learning offers a high sample efficiency which is particularly advantageous for practical experiments [45]. The objective of Q-learning is to learn to decide which action will yield the highest reward, depending on the state. The optimal Q-value Q* is hereby calculated by Equation (1): R is the expected total reward for the current state-action pair (s t , a t ). In addition to R, the maximum expectable Q-value of all possible next state-action pairs is estimated and discounted over time by the discount factor γ. In each training episode, the Q-learning algorithm updates the approximation of the Q-function with Equation (2), where α is the learning rate. Figure 5 visualizes the implementation of the DQA approach for the above specified use case. The input for the DQA, the state s t, is defined by the acceleration profile. With the state as input, the deep Q-agent predicts the optimal Q-value and thus selects the best action which represents discrete pressure levels. For the execution of the handling process, the process control system (PLC) sets the pressure levels to the proportional valve at pre-defined times. This timed pressure control supplies the vacuum ejector with (ideally) the exact amount of pressurized air that is needed for a robust but efficient handling process, in order to encounter the load occurring at the gripper-object-interface (GOI) with an adequately high holding force. The efficiency reward is computed by the measurement of the consumed air flow and combined with the cycle reward (success or failure) to the aggregated reward rt. This allows the agent to For the execution of the handling process, the process control system (PLC) sets the pressure levels to the proportional valve at pre-defined times. This timed pressure control supplies the vacuum ejector with (ideally) the exact amount of pressurized air that is needed for a robust but efficient handling process, in order to encounter the load occurring at the gripper-object-interface (GOI) with an adequately high holding force. The efficiency reward is computed by the measurement of the consumed air flow and combined with the cycle reward (success or failure) to the aggregated reward r t . This allows the agent to be trained with the newly generated dataset of state, action, and reward. The DQA (two fully connected hidden layers of 24 neurons) is implemented by means of the Keras framework and trained with a learning rate of 0.01.
To condensate the robot trajectory information into a state representation that is usable for the DQA, it is required to evaluate the respective trajectory with regard to the most relevant acceleration values. Figure 6 presents a method for feature extraction from acceleration profiles in pick and place processes. In general, typical pick and place trajectories mainly consist of a motion in Z which goes up initially, when an object is picked, and goes down in the end, when the object is placed. The resulting path roughly follows the form of a parabola (Figure 6 top, dashed line). The motion in X follows the shape of a sigmoid curve, initially accelerating to a constant velocity and decelerating in the end (Figure 6 top, dotted line). Since these motion patterns can be assumed to happen in most pick and place processes as shown qualitatively in Figure 6, it is possible to derive a generally applicable method for identifying the time windows where the input pressure should be varied. Regarding the resulting accelerations profiles of the X-and Z-motion, three points of pairs of interest (POI) can be defined (marked green in Figure 6, bottom), that are relevant for choosing the appropriate pressure level in accordance with common load cases (normal load, transversal load). This results in three process phases, where different pressure levels can be set. The first POI is associated with the initial acceleration of the robot, where a positive acceleration happens in both the X-and Z-component. The second POI is located at the transition from positive to negative acceleration in X. In this area, the Z-acceleration becomes zero. Hence, the input pressure (and therefore the holding force) can potentially be reduced. To decide at what time the pressure may be reduced after the initial acceleration, it is first necessary to assess what residual acceleration aR is permissible. This This results in three process phases, where different pressure levels can be set. The first POI is associated with the initial acceleration of the robot, where a positive acceleration happens in both the X-and Z-component. The second POI is located at the transition from positive to negative acceleration in X. In this area, the Z-acceleration becomes zero. Hence, the input pressure (and therefore the holding force) can potentially be reduced. To decide at what time the pressure may be reduced after the initial acceleration, it is first necessary to assess what residual acceleration a R is permissible. This defines the duration of Phase 1 (blue left-headed arrow in Figure 6). The third POI covers the maximum deceleration of the X-motion and the deceleration of Z approaching the target position for deposition of the object. Accordingly, based on the value of the initial acceleration a I at that point when the input pressure should be increased again, the duration of Phase 2 can be determined. Finally, the duration of Phase 3 is set, as well, since it lasts until the end of the process. For the application of Q-learning to the above-defined problem, it is required to define finite state and action spaces. Hence, discrete acceleration features were derived from each of the designed robot trajectories. For each of the defined phases, the maximum acceleration values for X, -Z, and +Z were determined and rounded to one decimal place. One set of these three acceleration values composes one state s t = [a x,max , a +z,max , a -z,max ]. The maximum obtained acceleration values accounted for a x,max = 10.0 m/s 2 , a +z,max = 6.2 m/s 2 and a -z,max = 5.2 m/s 2 , which therefore results in an overall space of 100x62x52=322,400 three-dimensional possible states in the specified use case scenario (however, not all of these states are physically possible).
In each of the three phases, based on the specific state, the agent decides for an input pressure between 0.0 and 6.0 bar, in steps of 0.25 bar. Thus, the action space contains 25 possible actions. For each episode (one complete handling process from start to end position), based on the phase-specific states, three actions are offline composed to one set and then executed. In the training process, the resulting efficiency rewards can be directly associated with each phase. The efficiency reward is calculated by dividing the compressed air saving, hence the difference between the reference air consumption c r and the measured compressed air consumption c m , by the reference air consumption c r , which was previously determined for each trajectory and a permanent pressure level of 6.0 bar. This reference case originates from industrial practice, since in most cases, a compressed air network is already present in the factory and supplies the air at 6-8 bar. For the first two phases, the reward is calculated as follows: For the last phase, it is additionally considered whether the handling process has been completed properly. The reward is calculated by where r C is the cycle reward which is set to 1 in case of a success and to -10 in case of failure. This enables to not consider the previous energy savings isolated but also their contribution to the process failure. If the package is lost in Phase 1 or Phase 2, the reward is directly set to zero for the respective phase. The state representations that were determined experimentally by the created trajectories and the proposed method for extraction of the acceleration features introduced in Figure 6, as well as the corresponding reference air consumption values for each trajectory (given in liters), are summarized in Table 1.

Design and Conduction of Experiments
Two experimental studies were designed and conducted in order to evaluate the capability of the DQA to reduce the energy consumption of the introduced use case. The first objective is to train an agent isolated for each trajectory. Hence, all 18 trajectories were implemented as subprograms in the robot control system and then executed in a main program accessed via indices. For the first experiment, 700 repetitions were planned for each trajectory. The indices were shuffled for randomization. In total, 12,600 repetitions were conducted in a duration of about 34 h. The E-Greedy algorithm with an exploration decay rate of ε = 0.999 was applied to initially ensure a sufficiently strong exploration behavior. In comparison to the Greedy algorithm, E-Greedy balances exploration and exploitation by means of the exploration decay rate. This means that at a probability of ε, a set of random actions is selected which is different from the action set that was selected by the DQA. Over time, this randomness decays since in each n-th episode, the probability for a random action set accounts for ε n (with ε min = 0.01).
The second experiment aims to evaluate the generalization capabilities of the DQA. The underlying idea is to train the DQA with data from 15 out of 18 trajectories (training data) in a total of 10,000 repetitions, again with ε = 0.999. Subsequently, the so-trained neural network is re-used as pre-trained DQA and applied to the remaining three trajectories (test data), that are so far unknown to the DQA, and trained with the acquired data. For 2000 additional repetitions, the exploration decay rate is set to ε = 0. For this second experiment, a 6-fold cross validation was planned in order to compare the generalization results for different combinations of training and test data sets. Since the trajectories have been generated by varying six different paths by three different values for the program override, three categories of trajectories are available: slow, medium, and fast. For selecting indices for the cross validation, it was ensured that the test indices contain exactly one of each category. Table 2 assigns the trajectories to the respective categories. Accordingly, the training and test indices were selected for the 6-fold cross validation. For each of the six folds, lists of 10,000 slots were filled by random selection out of the respective training indices (see Table 3), and analogous for the test indices. For each fold, the experiments took about 33 h (12,000 repetitions at~360 reps/h) in total. Table 3. Training and test indices for 6-fold cross validation of the second experiment.

Results
The results of the first experiments are shown in Figure 7. For better visibility, the running mean was computed with a range of 50 data points. In case of the slow and medium trajectories (columns 1 and 2), the reward quickly approaches the value of 3 within 400 episodes (~1 h). For the fast trajectories (column 3), the results are not that clear; the reward even decreases in two cases.

Results
The results of the first experiments are shown in Figure 7. For better visibility, the running mean was computed with a range of 50 data points. In case of the slow and medium trajectories (columns 1 and 2), the reward quickly approaches the value of 3 within 400 episodes (~1 h). For the fast trajectories (column 3), the results are not that clear; the reward even decreases in two cases. According to the introduced reward function, average energy savings of 50% up to about 70% can be achieved fast (compare Table 4). According to the introduced reward function, average energy savings of 50% up to about 70% can be achieved fast (compare Table 4). In general, two different cases of object loss happen. Firstly, a random object loss occurs when the exploration rate is sufficiently high for the DQA to select an action different from the predicted optimal action. For example, setting the pressure to zero in at least one of the three process phases will eventually lead to an object loss. Secondly, a mispredictions can also lead to an object loss. It is noticeable that especially in the case of the medium trajectories, almost no object losses due to mispredictions occur. Conversely, for the trajectories 3, 6, 9, and 12, a high amount of mispredictions detracts the agent from converging at all. In the example of trajectory 6, early stopping at approx. episode 380 would have led to a reward of about 2.5. The following training episodes seem to force the agent into overfitting.
The results of the second experiment with 6-fold cross validation are depicted in Figure 8. In general, the reward converges towards a value of 3 during training as the amount of object losses due to mispredictions decreases over time. An exception is the experiment associated with fold no. 3, where a significant accumulation of such events occurs around episode 9,000. With the pre-trained neural network, in all folds except no. 4, rewards are quickly reached that are either in the range of the final training reward or even higher (e.g., in fold no. 2). Fold no. 4 is another exception in this case, where several mispredictions lead to a more conservative behavior of the DQA. In general, two different cases of object loss happen. Firstly, a random object loss occurs when the exploration rate is sufficiently high for the DQA to select an action different from the predicted optimal action. For example, setting the pressure to zero in at least one of the three process phases will eventually lead to an object loss. Secondly, a mispredictions can also lead to an object loss. It is noticeable that especially in the case of the medium trajectories, almost no object losses due to mispredictions occur. Conversely, for the trajectories 3, 6, 9, and 12, a high amount of mispredictions detracts the agent from converging at all. In the example of trajectory 6, early stopping at approx. episode 380 would have led to a reward of about 2.5. The following training episodes seem to force the agent into overfitting.
The results of the second experiment with 6-fold cross validation are depicted in Figure 8. In general, the reward converges towards a value of 3 during training as the amount of object losses due to mispredictions decreases over time. An exception is the experiment associated with fold no. 3, where a significant accumulation of such events occurs around episode 9,000. With the pre-trained neural network, in all folds except no. 4, rewards are quickly reached that are either in the range of the final training reward or even higher (e.g., in fold no. 2). Fold no. 4 is another exception in this case, where several mispredictions lead to a more conservative behavior of the DQA.

Discussion
For both experiments conducted, a simple DQA was set up and trained with the acquired data. Without extensive tuning of hyper parameters, particularly for slow and medium trajectories, energy savings of 50% (reward of 2.5) to almost 70% (reward of 3.1) were achieved within an hour of iterative experiments for isolated trajectories. In case of

Discussion
For both experiments conducted, a simple DQA was set up and trained with the acquired data. Without extensive tuning of hyper parameters, particularly for slow and medium trajectories, energy savings of 50% (reward of 2.5) to almost 70% (reward of 3.1) were achieved within an hour of iterative experiments for isolated trajectories. In case of the fast trajectories, further tuning of the parameters (e.g., learning rate or neural network size) may be required to improve the behavior. The results of the 6-fold cross validation experiments show that in most of the cases considered, the previously trained trajectories support the fast finding of pressure levels that are beneficial in terms of an improved energy-efficiency of the handling processes. In such cases where object losses occur due to mispredictions of the DQA, parameter adaptations are necessary. The agent acts too incautious in these cases, repeatedly causing too little holding forces and therefore object loss. With a more conservative DQA, energy savings of 50% are estimated to be realistic and can be achieved within a relatively short period of training time.
The introduced method offers great potential for industrial application. In the scope of this work, the object was not varied and the robot trajectory was analyzed manually. However, in future applications, it is straightforward to implement an automated detection of information such as package weight, carton type, and the planned robot trajectory, e.g., through scanning a QR code or an RFID chip on the package. The industrial applicability of the DQA could also be made more flexible if the DQA and the control system including sensors and valve were deployed in a decentral system. Independent of the present system setup (e.g., regardless of the exact robot control system and the corresponding programming interface) and adjacent processes, this decentral system could work completely self-sufficient. This would make it possible to collect data from different applications and use cases in order to make the DQA more versatile and more capable of generalization, if needed. Finally, the comprehensive and scalable application of such an intelligent unit can enable immense energy savings in a broad spectrum of industrial pick and place processes. support the fast finding of pressure levels that are beneficial in terms of an improved energy-efficiency of the handling processes. In such cases where object losses occur due to mispredictions of the DQA, parameter adaptations are necessary. The agent acts too incautious in these cases, repeatedly causing too little holding forces and therefore object loss. With a more conservative DQA, energy savings of 50% are estimated to be realistic and can be achieved within a relatively short period of training time.
The introduced method offers great potential for industrial application. In the scope of this work, the object was not varied and the robot trajectory was analyzed manually. However, in future applications, it is straightforward to implement an automated detection of information such as package weight, carton type, and the planned robot trajectory, e.g., through scanning a QR code or an RFID chip on the package. The industrial applicability of the DQA could also be made more flexible if the DQA and the control system including sensors and valve were deployed in a decentral system. Independent of the present system setup (e.g., regardless of the exact robot control system and the corresponding programming interface) and adjacent processes, this decentral system could work completely self-sufficient. This would make it possible to collect data from different applications and use cases in order to make the DQA more versatile and more capable of generalization, if needed. Finally, the comprehensive and scalable application of such an intelligent unit can enable immense energy savings in a broad spectrum of industrial pick and place processes. Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: https://lnk.tu-bs.de/sf7HRt.