Compensating data shortages in manufacturing with monotonicity knowledge

Optimization in engineering requires appropriate models. In this article, a regression method for enhancing the predictive power of a model by exploiting expert knowledge in the form of shape constraints, or more specifically, monotonicity constraints, is presented. Incorporating such information is particularly useful when the available data sets are small or do not cover the entire input space, as is often the case in manufacturing applications. The regression subject to the considered monotonicity constraints is set up as a semi-infinite optimization problem, and an adaptive solution algorithm is proposed. The method is applicable in multiple dimensions and can be extended to more general shape constraints. It is tested and validated on two real-world manufacturing processes, namely laser glass bending and press hardening of sheet metal. It is found that the resulting models both comply well with the expert's monotonicity knowledge and predict the training data accurately. The suggested approach leads to lower root-mean-squared errors than comparative methods from the literature for the sparse data sets considered in this work.


Introduction
Conventional machine learning models are purely data-based. Accordingly, the predictive power of such models is generally bad if the underlying training data D = {(x l , t l ) : l ∈ {1, . . . , N }} is insufficient. Unfortunately, such data insufficiencies occur quite often, and they can come in one of the following forms: On the one hand, the available data sets can be too small and have too little variance in the input data points x 1 , . . . , x N . This problem frequently occurs in manufacturing [55] because varying the process parameters beyond well tested operating windows is usually costly. On the other hand, the output data t 1 , . . . , t N can be too noisy.
Aside from potentially insufficient data, however, one often also has additional knowledge about the relation between the input variables and the responses to be learned. Such extra knowledge about the considered process is referred to as expert knowledge in the following. In [26] the interaction of users with a software is tracked to capture their expert knowledge in a general form as training data for a classification problem. In [19] expert knowledge is used in the form of a specific algebraic relation between in-and output to solve a parameter estimation problem with artificial neural networks. Such informed machine learning [43] techniques beneficially combine expert knowledge and data to build hybrid or gray-box models [54,29,30,8,57,58,3,18], which predict the responses more accurately than purely data-based models. In other words, by using informed machine learning techniques, one can compensate data insufficiencies with expert knowledge.
An important and common type of expert knowledge is prior information about the monotonicity behaviour of the unknown functional relationship x → y(x) to be learned. A lot of concrete application examples with monotonicity knowledge can be found in [1,Sec. 4.1] or [21,Sec. 1], for instance. The present article exclusively deals with regression under such monotonicity requirements. For classification under monotonicity constraints, see e.g. [21,24]. Along with convexity constraints, monotonicity constraints are probably the most intensively studied shape constraints [15] in the literature and correspondingly, there exist plenty of different approaches to incorporate monotonicity knowledge in a machine learning model. See [16] for an extensive overview. Very roughly, these approaches can be categorized according to when the monotonicity knowledge is taken into account: in or only after the training phase. In the terminology of [43], this corresponds to the distinction between knowledge integration in the learning algorithm or in the final hypothesis, respectively.
A lot of methods -especially from the mathematical statistics literature -like [31,27,28,17,10,11,25] incorporate monotonicity knowledge only after training. These articles start with a purely data-based initial model, which in general does not satisfy the monotonicity requirements, and then monotonize this initial model according to a suitable monotonization procedure like projection [31,27,28,25], rearrangement [10,11,6] or tilting [17]. Among other things, it is shown in the mentioned articles that, in spite of noise in the output data, the arising monotonized models are close to the true relationship for sufficiently large training data sets. Summarizing, these articles show that for large data sets noise in the output data can be compensated by monotonization to a certain extent.
In contrast to that, in some works like [1,23,7,39,34,16] monotonicity knowledge is incorporated already in training. In these articles, the monotonicity requirements are added as constraints -either hard [16,23,39,34] or soft [1,23] -to the databased optimization of the model parameters. In [39] and [1], probabilistic monotonicity notions are used. In [23,7,39,34] support vector regressors in the linear-programming or the more standard quadratic-programming form, Gaussian process regressors, and neural network models are considered, respectively, and monotonicity of these models is enforced by constraints on the model derivatives at predefined sampling points [23,39,34] or on the model increments between predefined pairs of sampling points [7].
A disadvantage of the projection-and rearrangement-based approaches [10,11,25] from the point of view of manufacturing applications is that these methods are tailored to large data sets. Another disadvantage of these approaches is that the resulting models typically exhibit distinctive kinks, which are almost always unphysical. Also, the models resulting from the multidimensional rearrangement method by [11] are not guaranteed to be monotonic when trained on small data sets. A drawback of the tilting approach from [17] is that it is formulated and validated only for one-dimensional input spaces (intervals in R). Accordingly, naively extending the non-adaptive discretization scheme from [17] to higher dimensions would result in long computation times. A downside of the in-training methods from [23,39,34] is that the sampling points at which the monotonicity constraints are imposed have to be chosen in advance (even though they need not coincide with the training data points). And finally, the method from [16] is limited to multilinear models (that is, linear combinations of monomials x α 1 1 · · · x α d d where each exponent α j is either 0 or 1). Such multilinear models are potentially not general and complex enough for various real-world applications.
The method proposed in the present article addresses the aforementioned issues and shortcomings. In Sec. 2, our methodology for monotonic regression using semi-infinite optimization is introduced. It incorporates the monotonicity knowledge already during training. Specifically, polynomial regression models are assumed for the input-output relationships to be learned. Since there is no after-training monotonization step in the method, our models are smooth and, in particular, do not exhibit kinks. Also, due to the employed adaptive discretization scheme, the method is computationally efficient also in higher dimensions. As far as we know, such an adaptive scheme has not been applied to solve monotonic regression problems before, especially not in situations with sparse data. In Sec. 4, the method is validated by means of two applications to realworld processes which are both introduced in Sec. 3, namely laser glass bending and press hardening of sheet metal. It turns out that the adaptive semi-infinite optimization approach to monotonic regression is better suited for the considered applications with their small data sets and the resulting models are more accurate than those obtained with the comparative approaches from the literature.

Methods
In this section (more precisely in Secs. 2.1-2.3), our semi-infinite optimization approach to monotonic regression is introduced. In Sec. 2.4, our rather simple method for numerically computing the monotonic projections from [25], which are later compared to the models obtained with our semi-infinite method, is described.

Monotonic regression using semi-infinite optimization
In our approach to monotonic regression, polynomial models are used for all input-output relationships x → y(x) to be learned. In the above relation (2.1), the sum extends over all d-dimensional multi-indices [13, Sec. 1.1] α = (α 1 , . . . , α d ) ∈ N d 0 with degree |α| := α 1 + · · · + α d less than or equal to some total degree m ∈ N. The terms x α := x α 1 1 · · · x α d d are the monomials in d variables of degree less than or equal to m and w α are the corresponding model parameters to be tuned by regression. Since there are exactly d-dimensional monomials of degree less than or equal to m, the polynomial regression model (2.1) can be equivalently written as In general, the resulting model x → y w (x) does not necessarily exhibit the monotonicity behaviour an expert expects for the underlying true physical relationship x → y(x).
In order to enforce the expected monotonicity behaviour, the following constraints on the signs of the partial derivatives ∂ x j y w (x) are added to the unconstrained standard regression problem (2.4): σ j · ∂ x j y w (x) ≥ 0 for all j ∈ J and x ∈ X . (2.5) The numbers σ j ∈ {−1, 0, 1} indicate the expected monotonicity behaviour for each coordinate direction j ∈ {1, . . . , d}: • σ j = 1 or σ j = −1 indicate that x → y(x) is expected to be, respectively, monotonically increasing or decreasing in the jth coordinate direction, • σ j = 0 indicates that one has no monotonicity knowledge in the jth coordinate direction.
Also, J := {j ∈ {1, . . . , d} : σ j = 0} is the set of all directions for which a monotonicity constraint is imposed, and the vector is referred to as the monotonicity signature of the relationship x → y(x). And finally, X ⊂ R d is the (continuous) subset of the input space on which the polynomial model (2.1) is supposed to be a reasonable prediction for x → y(x). In this work, X is chosen to be identical with the range covered by the input training data points x 1 , . . . , x N . I.e., X is the compact hypercuboid set with a j := min l=1,...,N x l,j and b j := max l=1,...,N x l,j and with x l,j denoting the jth component of the lth input data point x l . Writing for brevity, our monotonic regression problem (2.4)-(2.5) takes the neat and simple form (2.9) Since the input set X is continuous and hence contains infinitely many points x, the monotonic regression problem (2.9) features infinitely many inequality constraints. Consequently, (2.9) is a semi-infinite optimization problem [20,36,38,49,50,45] (or more precisely, a standard semi-infinite optimization problem, as opposed to a generalized one). It is well-known that the monotonic regression problem (2.9), just like any other semi-infinite problem, can be equivalently rewritten as a bi-level optimization problem [49,46,9], namely Commonly, the overall optimization problem (2.10) is referred to as the upper-level problem of (2.9), while the subproblems in the constraints of (2.10) are called the lower-level problems of (2.9). It is also well-known [49,50] that the innocent-looking reformulation of semi-infinite problems like (2.9) as bi-level problems is the key for both the theoretical and the numerical treatment of such problems.

Adaptive solution strategy
In order to solve the semi-infinite monotonic regression problem (2.9), a variant of the adaptive, iterative discretization algorithm by [5] is used. In a nutshell, the idea is the following: the infinite index set X of the original regression problem (2.9) is iteratively replaced by discretizations, that is, finite subsets X k ⊂ X. These discretizations are adaptively refined from iteration to iteration. In that manner, in every iteration k the ordinary (finite) optimization problem featuring only finitely many inequality constraints is obtained. (2.11) is referred to as the kth (discretized) upper-level problem (or, following [5], the kth approximating problem for (2.9)). Then each iteration k consists of two steps, namely an optimization step and an adaptive refinement step. The optimization step computes a solution w k of the kth upper-level problem (2.11). And the refinement step computes for each direction j ∈ J a point x k+1,j ∈ X at which the jth monotonicity constraint at w = w k is violated most. I.e., for every j ∈ J an approximate solution x k+1,j of the global optimization problem is computed, which is referred to as the (k, j)th lower-level problem (or, following [5], the (k, j)th auxiliary problem). Then all the points x k+1,j for which a monotonicity violation occurs are added to the current discretization X k in order to obtain the new discretization X k+1 . As soon as no more monotonicity violations occur, iterating is stopped. With regard to the practical implementation of the above solution strategy, it is important to observe that the discretized upper-level problems (2.11) are standard convex quadratic programs [35]. Indeed, inserting (2.3) into (2.7) and using the design matrix Φ with entries Φ li := ϕ i (x l ), one obtains Consequently, the objective function of (2.11) is indeed quadratic and convex w.r.t. w. Additionally, in view of the constraints of (2.11) are indeed linear w.r.t. w.
With regard to the practical implementation, it is also important to observe that the objective functions x → g j (w k , x) of the lower-level problems (2.12) are non-convex polynomials and therefore in general have several local minima. So, (2.12) needs to be solved numerically with a global optimization solver.

Algorithm and implementation details
In the following, our adaptive discretization algorithm is described in detail. As has already been pointed out above, it is a variant of the general algorithm developed by [5,Sec. 2], and it is explained after Algorithm 1 how our variant differs from its prototype [5].
1. Initialize: set k = 0 and choose X 0 as a coarse (but non-empty) rectangular grid in X.
2. Solve the kth upper-level problem (2.11) to obtain optimal model parameters w k ∈ R Nm .
3. Solve the (k, j)th lower-level problem (2.12) approximately for every j ∈ J to find approximate global minimizers x k+1,j ∈ X. Add those of the points x k+1,j , for which substantial monotonicity violations occur, i.e., for which g j (w, x k+1,j ) < −ε j , to the current discretization X k and go to Step 2 with k = k + 1. If for none of the points x k+1,j substantial monotonicity violations occur, go to Step 4. In contrast to [5], the Algorithm 1 above does not require exact solutions of the (nonconvex) lower-level problems. Indeed, Step 3 of Algorithm 1 only requires to approximate a solution numerically. Therefore, slight constraint violations are tolerated and the finalization step (Step 4) is introduced. Another, but minor, difference compared to the algorithm from [5] is that there there are several lower-level problems in each iteration and not just one, because monotonicity is enforced in multiple coordinate directions in general.
As for the tolerances ε j (Steps 3 and 4 of Algorithm 1), a monotonicity violation of 1 % of the ranges covered by the in-and output training data is allowed for: 14) The degrees m of the polynomial models in this work are chosen as the largest possible values that do not result in an overfit, because increasing m enhances the model accuracy in general. In this respect, the number of model parameters is allowed to exceed the number of data points (N m ≥ N ), since the constraints represent additional information supplementing the data. As for the reference grid X ref in the finalization step (Step 4 of Algorithm 1), 20 values per input dimension equidistantly distributed from the lower to the upper bound along each direction are used. Algorithm 1 was implemented in Python and the package sklearn was used for the numerical representation of the models. Since the discretized upper-level problems (2.11) are standard convex quadratic programs, a solver tailored to that specific problem class is used, namely quadprog [14]. It can solve quadratic programs with hundreds of variables and thousands of constraints in just a few seconds because it efficiently exploits the simple structure of the problem. Since on the other hand the lower-level problems (2.12) are global optimization problems with possibly several local minima, a suitable global optimization solver is required. The solver scipy.optimize.shgo [12] was used, which employs a simplicial homology strategy and which, in our applications, turned out to be a good compromise between speed and reliability. For the problems considered in this article, shgo's internal local optimization was configured to occur in every iteration, to multi-start from a Sobol set of 100×d points and to be executed using the algorithm L-BFGS-B with analytical gradients.
Step 4 in Algorithm 1 ensures that shgo does not miss locations where the monotonicity is not as required.

Computing monotonic projections
In order to validate our semi-infinite optimization approach to monotonic regression, it will be compared, among other things, to the projection-based monotonization approach by [25]. As has already been pointed out in Sec. 1, the projection method starts out from a purely data-based initial model y 0 (a Gaussian process regressor in the case of [25]) and then replaces this initial model by the monotonic projection y of y 0 . I.e., y : X → R is the monotonic square-integrable function with monotonicity signature σ that is closest to y 0 in the L 2 -norm.
In order to numerically compute this monotonic projection y, the original procedure proposed by [25] is not used here, though. Instead, the conceptually simpler methodology by [44] is employed. First, the input space X is discretized with a fine rectangular grid G. Then the corresponding discrete monotonic projection ( y(x)) x∈G , that is, the solution of the constrained optimization problem is computed. In the above relation, R G is the |G|-dimensional vector space of all R-valued functions z = (z(x)) x∈G defined on the discrete set G, h j > 0 indicates the distance of adjacent grid points in the jth coordinate direction, and e j ∈ R d is the jth canonical unit vector. It is shown in [44] that the extension of ( y(x)) x∈G to a grid-constant function on the whole of X is a good approximation of the monotonic projection y, if only the grid is fine enough and the initial model y 0 is continuous, for instance. Since both the objective function and the constraints of (2.15) are convex w.r.t. z, the problem (2.15) is a convex program. cvxopt [2] is used to solve these problems because it offers a sparse matrix type to represent the large coefficient and constraint matrices for d > 1. Alternatively, the discrete monotonic projection problems can also be solved using any of the more sophisticated computational methods by [4,Sec. 2.3]; [42,Sec. 4.1]; [47,48,52,53] or [22]. Yet, for the number of input dimensions considered here, our direct computational method is sufficient.

Laser glass bending
The first application example is laser glass bending. In the industrial standard process of glass bending [32], a flat glass specimen is placed in a furnace with furnace temperature T f , and then the heated specimen is bent at a designated edge driven by gravity. As an additional feature, a laser can be added to the industrial standard process in order to specifically heat up the critical region of the flat glass around the bending edge and thus, to speed up the process and achieve smaller bending radii [40,41]. The laser can generally scribe in arbitrary directions. In the process considered here, however, the laser path is restricted to three straight lines parallel to the bending edge. While the middle line defines the bending edge, the outer two lines are at a fixed distance ∆ l /2 = 5.75 mm in each direction to it. The laser spot moves along this path in multiple cycles with the number of cycles denoted by n c . The scribing speed and the power of the laser are held constant. A mechanical stop below the bending edge guarantees that the bending angle does not exceed 90 • . An illustration of the laser glass bending process is shown in Fig. 1. The goal of the glass bending process considered here is to obtain bent glass parts with a bending angle as close as possible to 90 • . In order to achieve this goal, a sufficient amount of heat has to be induced. Thus, the modelled quantity is the bending angle y := β as a function of the two process variables where X ⊂ R 2 is the rectangular set with the bounds specified in Tab. 1. Since generating experimental training data from the real process is cumbersome, a two-dimensional finite-element model was set up to generate data numerically. The simulation of the process is based on a coupled thermo-mechanical problem with finite deformation. Since the CO 2 laser used in the process operates in the opaque wave length spectrum of glass, the heat supply is modelled as a surface flux into the deforming sheet. In this two-dimensional setting, the heat is assumed to be deposited instantaneously along the thickness direction and also instantaneously on all three laser lines. Radiation effects are ignored here and heat conduction inside the glass is described by the classical Fourier law with the heat conductivity obtained experimentally via laser flash analysis. In view of the relevant relaxation and process time scales for the applied temperature range, the mechanical behaviour of the glass is described by a simple Maxwell-type visco-elastic law. The deformation due to gravity is heavily affected by the pronounced temperature dependence of the viscosity above the glass transition, which is described here using the Williams-Landel-Ferry approximation [56]. The simulation is conducted using the commercial finite-element code Abaqus © . It was used to create a training data set comprising 25 data points sampled on a 2D rectangular grid. The values used for the two degrees of freedom (five for T f and five for n c ) are placed equidistantly from the lower to the upper bounds given in Tab. 1.
Within these ranges and for the laser configuration described above, process experts expect the following monotonicity behaviour: the bending angle y = β increases monotonically with increasing glass temperature in the critical region and thus, with increasing T f and n c . In other words, the monotonicity signature σ of the bending angle y as a function of the inputs x from (3.1) is expected to be σ = (σ 1 , σ 2 ) = (1, 1). (3.2)

Forming and press hardening of sheet metal
Another application example is press hardening [33]. Within the used experimental setup of this process, a blank is placed in a chamber furnace with a furnace temperature T f above 900 • C. After heating the blank, an industrial robot transports it with handling time t h into the cooled forming tool. In the following, the extra handling time ∆t h = t h − 10 s is used instead, with 10 s being the minimum time the robot takes to move the blank from the furnace to the press. The final combined forming and quenching step allows for the variation of the press force F p and the quenching time t q . Afterwards, the formed part is transferred by the industrial robot to a deposition table for further cooling. An illustration of the process chain is shown in Fig. 2. Figure 2: Side view of the press hardening process [33] indicating the considered process steps. Symbols: T f -furnace temperature, t h -handling time, F p -press force, t qquenching time.
The goal of the press hardening process considered in this work is to obtain a formed metal part that is as hard as possible, where the hardness is measured in units of the Vickers hardness number (unit symbol HV). In order to achieve this goal, a sufficiently fast cooling rate during the quenching step is necessary to induce a microstructural phase change in the material, which in turn leads to high hardness values. Thus, the output quantity to be modelled is the hardness y of the formed part (at distinguished measurement points on the surface of the part) as a function of the four process variables where X ⊂ R 4 is the hypercuboid set with the bounds specified in Tab. 2. As in the case of glass bending, however, experiments for the press hardening process are expensive because they usually require manual adjustments, which tend to be timeconsuming. And the local hardness measurements at the chosen measurement points on the surface of the quenched part are time-consuming as well. This is why the training data base is rather small. It contains 60 points resulting from a design of experiments with the four process variables T f , ∆t h , F p , t q ranging between the bounds in Tab. 2, along with the corresponding hardness values at six local measurement points (referred to as MP1, . . . , MP6 in the following).
In order to compensate this data shortage, expert knowledge is brought into play. An expert for press hardening expects the hardness to decrease monotonically with ∆t h and to increase monotonically with T f as well as with t q . In other words, the monotonicity signature σ of the hardness y (at any given measurement point) as a function of the inputs x from (3.3) is expected to be σ = (σ 1 , . . . , σ 4 ) = (1, −1, 0, 1). (3.4) In fact, a press hardening expert expects even a bit more, namely that the hardness grows in a sigmoid-like manner with T f and that it grows concavely towards saturation with increasing t q . All these requirements result from qualitative physical considerations and are supported by empirical experience.

Results and discussion
In this section, the semi-infinite monotonic regression method is applied to the industrial processes described in Sec. 3 and compared to other approaches for incorporating monotonicity knowledge, which are known from the literature. The acronym SIAMOR is used for the approach. It stands for semi-infinite optimization using an adaptive discretization scheme for monotonic regression.

Informed machine learning models for laser glass bending
To begin with, the SIAMOR method is validated on a 1D subset of the data for laser glass bending, namely the subset of all data points for which n c = 50. This means that, out of the 25 data points, five points remain for training. First of all, ordinary unconstrained regression techniques are tried (see Fig. 3a). A polynomial model of degree m = 3 (solid line) and a Gaussian process regressor [37] (GPR, dashed line) do not comply with the monotonicity knowledge at high T f . A radial basis function (RBF) kernel was used for the GPR. This non-parametric model is always a reasonable choice for simulated data because it accurately reproduces the data themselves if the noise-level parameter is kept small. For all GPR models in this work, that parameter was set to 10 −5 . Next, the polynomial model is regularized in a ridge regression (dotted line), where the squared l 2 -norm λ w 2 2 with a regularization weight λ is added to the objective function in (2.4). λ = 0.003 is chosen, which is roughly the minimum necessary value to achieve monotonicity. But the resulting model does not predict the data very well. Thus, all three models from Fig. 3a are unsatisfactory. As a next step, the monotonicity requirement w.r.t. T f is brought to bear, and monotonic regression with the SIAMOR method (m = 5) is used (see Fig. 3b) and compared to the rearrangement [10] and to the monotonic projection [25] of the Gaussian process regressor from Fig. 3a. As mentioned before, both comparative methods are based on a non-monotonic pre-trained reference predictor. This is a fundamental difference to the SIAMOR method, which imposes the monotonicity already in the training phase. The projection was calculated as described in Sec. 2.4 with |G| = 80 grid points. For the rearrangement method, the R package monreg was invoked from Python using the package rpy2. The degree m of the polynomial ansatz (2.1) used in the SIAMOR method is chosen as described in Sec. 2.3. For the specific case considered here, the curve starts to vary unreasonably (albeit still monotonically) between the data points for m ≥ 6 and therefore m = 5 was chosen. The SIAMOR algorithm was initialized with five equidistant constraint locations in X 0 , and it converged in iteration 5 with a total number of 9 constraints. The locations of the constraints are marked in Fig. 3b by the gray, vertical lines. The adaptive algorithm automatically places the non-initial constraints in the non-monotonic region at high T f . In terms of the root-mean-squared error on the training data, the SIAMOR model fits the data best, see Tab. 3. Another advantage of the SIAMOR model is that it is continuously differentiable. 3) with degree m = 5. The projection [25] (dash-dotted) and rearrangement [10] (dotted) methods were fed with the dashed GPR curve as non-monotonic reference predictor. Table 3: Root-mean-squared deviations (RMSE) of the monotonic regression models from the training data for laser glass bending (1D) Monotonic regression type RMSE [ • ] projection [25] 1.3822 rearrangement [10] 1.8432 SIAMOR 1.1598 After these calculations on a 1D subset, the full 2D data set of the considered laser glass bending process with its 25 data points is now used. The results are shown in Fig. 4. Again, part a) of the figure displays an unconstrained Gaussian process regressor for comparison. The RBF kernel contained one length scale parameter per input dimension, and sklearn correctly adjusted these hyperparameters using L-BFGS-B. I.e., the employed length scales maximize the log-likelihood function of the model. Nevertheless, the model is unsatisfactory because it exhibits a bump in the rear right corner of the plot, contradicting the monotonicity knowledge. Fig. 4b shows the 2D monotonic projection of the GPR with the monotonicity requirements (3.2) w.r.t. T f and n c . It was calculated according to Sec. 2.4 on a rectangular grid G consisting of 40 2 points (40 values per input dimension). The resulting model looks generally reasonable and, in particular, satisfies the monotonicity specifications, but it exhibits kinks and plateaus. The most conspicuous kink starts at about T f = 546 • C, n c = 50 and proceeds towards the front right. The rearrangement method by [11] is not used for comparison here because for small data sets in d > 1, it does not guarantee monotonicity.

Informed machine learning models for forming and press hardening
As in the glass bending case, the SIAMOR method is first validated on a 1D subset of the data for the press hardening process. Namely, only those data points with F p = 2250 kN, ∆t h = 4 s and t q = 2 s are considered. These specifications are met by six data points, and these were used to train the models shown in Fig. 5. The data reflect the expected sigmoid-like behaviour mentioned in Sec. 3.2, and this extends to the monotonized models. An unconstrained polynomial with m = 3 was chosen as the reference model to be monotonized for the comparative methods from the literature. Degrees lower than that result in larger deviations from the data and degrees higher than that result in an overfit. Thus, out of all models of the form (2.1), the hyperparameter choice m = 3 yields the lowest RMSE values for projection and rearrangement. For the monotonic regression with SIAMOR, m = 6 and five equidistant initial constraint locations in X 0 were chosen. It converged in iteration 8 with a total of 12 monotonicity constraints. In terms of the root-mean-squared error, the SIAMOR model predicts the training data more accurately, as can be seen in Tab. 4. The reason is that the rearrangement-and projection-based models are dragged away from the data by the underlying reference model, especially at high T f .
After these 1D considerations, the SIAMOR method is now validated on the full 4D data set of the press hardening process. Polynomial models with degree m = 3 are used   [25] 5.0893 rearrangement [10] 4.8346 SIAMOR 3.3583 for unconstrained regression and monotonic projection, and polynomials with m = 6 are used for the SIAMOR method. The resulting models are visualized in the surface plots in Fig. 6. The unconstrained model from Fig. 6a clearly shows non-monotonic predictions w.r.t. ∆t h . Furthermore, the hardness is slightly decreasing with the furnace temperature at T f close to 930 • C, which is not the behaviour expected by the process expert either. Fig. 6b shows the monotonic projection of the unconstrained model. It was computed according to Sec. 2.4 on a grid G consisting of 40 4 points. The monotonic projection exhibits the kinks that are characteristic of that method and it yields an RMSE of 28.84 HV on the entire data set.
With an overall RMSE of 10.14 HV, the model resulting from SIAMOR is more accurate for this application. A corresponding response surface is displayed in Fig. 6c. In keeping with (3.4), monotonicity was required w.r.t. T f (increasing), ∆t h (decreasing) and t q (increasing). As m = 6, there were N m = 210 model parameters and the dicretization X 0 was initialized with a grid using four equidistant values per dimension. The algorithm converged in iteration 246 with 1372 final constraints. Our first try was with only two monotonicity requirements (namely w.r.t. T f and t q ) with the observation that the final number of iterations decreases when the third monotonicity requirement is added. Thus, the monotonicity requirements in each direction promote each other numerically within the algorithm and for the used data. Yet, this reduction in the number of iterations is not accompanied by a decrease in total calculation time because more lower-level problems have to be solved when there are more monotonicity directions.
With SIAMOR, monotonicity was achieved in all three input dimensions where it was required. See e.g. Fig. 6c, which is the monotonic counterpart of Fig. 6a. A comparison of Figs. 6a-c clearly shows how incorporating monotonicity expert knowledge helps compensate data shortages. Indeed, taking no monotonicity constraints into account at all  ( Fig. 6a), one obtains an unexpected hardness minimum w.r.t. ∆t h at ∆t h ≈ 2.5 s and small T f . This also results in unnecessarily low predictions of the monotonic projection for small T f and ∆t h 1.5 s in Fig. 6b. The SIAMOR model (Fig. 6c), by contrast, predicts more reasonable hardness values in this range without needing additional data because it integrates the available monotonicity knowledge already in the training phase.
For the SIAMOR plots in Fig. 7, ∆t h was reduced to 0 s. It shows that monotonicity is also achieved w.r.t. t q . Without having explicitly demanded it, the hardness y shows the expected concave growth towards saturation w.r.t. t q in Fig. 7a. An additional increase in F p leads to Fig. 7b, where the sign of the second derivative of y w.r.t. the quenching time t q changes along the T f -axis. I.e., the model changes its convexity properties in this direction and increases convexly instead of concavely with t q at high T f , ∆t h = 0 s and F p = 2250 kN. This contradicts the process expert's expectations. A possible way out is to measure additional data (e.g. in the rear left corner of Fig. 7b), which is elaborate and costly, however. Another possible way out is to add the concavity requirement ∂ 2 x 4 y w (x) ≤ 0 for all x ∈ X w.r.t. the x 4 = t q direction to the monotonicity constraints (2.5) used exclusively so far. In order to solve the resulting constrained regression problem, one can use the same adaptive semi-infinite solution strategy which was already used for the monotonicity constraints alone.

Conclusion and outlook
In this article, a proof of concept is conducted for the method of semi-infinite optimization with an adaptive discretization scheme to solve monotonic regression problems (SIAMOR). The method generates continuously differentiable models and its use in multiple dimensions is straightforward. Polynomial models were used, but the method is not restricted to this type of model, even though it is numerically favourable because polynomial models lead to convex quadratic upper-level problems. The monotonic regression technique is validated by means of two real-world applications from manufacturing. It results in predictions that comply very well with expert knowledge and that compensate the lack of data to a certain extent. At least for the small data sets considered here, the resulting models predict the training data more accurately than models based on the well-known projection or rearrangement methods from the literature.
While the present article is confined to regression under monotonicity constraints, semi-infinite optimization can also be exploited to treat other types of shape constraints such as concavity constraints, for instance. In fact, the shape constraints can be quite arbitrary, in principle. And this is only one of several aspects in the field of potential research on the method opened up by this work. Others are the testing of SIAMOR in combination with different model types, data sets or industrial processes. When using Gaussian process regressors instead of the polynomial models employed here, one can try out and compare various kernel types. Additionally, the SIAMOR method can be extended to locally varying monotonicity requirements (i.e. σ j = σ j (x)).
Another possible direction of future research is to systematically investigate how to speed up the solution of the global lower-level problems. When more complex models or shape constraints are used, this becomes particularly important. The solution of multiple lower-level problems and the final feasibility test on the reference grid can be parallelized to reduce the calculation time, for example. A rigorous investigation of the convergence properties and the asymptotic properties of the SIAMOR method and its possible generalizations is left to future research as well.