Power of human-algorithm collaboration in solving combinatorial optimization problems

Many combinatorial optimization problems are often considered intractable to solve exactly or by approximation. An example of such problem is maximum clique which -- under standard assumptions in complexity theory -- cannot be solved in sub-exponential time or be approximated within polynomial factor efficiently. We show that if a polynomial time algorithm can query informative Gaussian priors from an expert $poly(n)$ times, then a class of combinatorial optimization problems can be solved efficiently in expectation up to a multiplicative factor $\epsilon$ where $\epsilon$ is arbitrary constant. While our proposed methods are merely theoretical, they cast new light on how to approach solving these problems that have been usually considered intractable.


Introduction
Human-in-the-loop (HITL) AI has gained quite a lot of attention during the past years [1]. HITL AI has emerged to compete with autonomous systems in fields where the interference of a human user is required to solve certain hard problems that might be intractable for an autonomous AI. HITL AI aims to find solutions to problems by involving human to learning process of an AI model. Such interaction can be, for instance, model adjusting, hyperparameter tuning or data processing. As a new competitor of autonomous AI, human-algorithm collaboration has shown promising results in educational data mining and learning analytics [2]. Other promising fields of human-algorithm collaboration include human-in-the-loop optimization [3] and human-robot interaction [4].
Solving hard combinatorial optimization problems such as maximum satisfiability [5], maximum clique [6] or minimum vertex cover [7] is usually considered intractable. Consensus in computational complexity theory is that these problems cannot have efficient algorithms that would always yield correct results [8].
That is, no polynomial time algorithms to solve them exist. This further implies that some problems, such as finding the maximum sized cliques or independent sets in graphs, cannot even be approximated well and efficiently in the same time [9] if widely assumed super polynomial lower bounds for the problems hold.
Solving such hard problems in general has not gained much attention in human-algorithm collaboration theory as main focus has been in AI applications [10] such as learning [11]. In this paper we show that, through humanalgorithm collaboration, a combinatorial minimization or maximization problem that has combinatorial complexity of O(2 n ) (or as a matter of fact, any O(poly(n) poly(n) ), can be solved efficiently up to an arbitrary multiplicative factor if the algorithm can query Gaussian priors from a human expert during the execution. Combinatorial complexity means the size of the problem's search space. For instance, in maximum clique, the combinatorial complexity is O(2 n ) but in travelling salesman problem, the combinatorial complexity is O(n!). Note that any NP optimization problem can be reduced into, say, clique problem, which has combinatorial complexity of O(2 n ). We further show that the scope of possible problems solvable with the proposed method includes even problems with super-exponential search spaces.
Bayesian optimization is a framework to solve hard function problems and is based on Bayesian statistics. Usually, before the actual Bayesian optimization, an expert provides algorithm some information about the problem that is being optimized in a form of a Gaussian prior. Gaussian prior reflects one's understanding on shape and smoothness of the objective function. Querying Gaussian priors from human experts is very usual assumption in Bayesian optimization literature [12]. In fully Bayesian approach to AI and global optimization, the optimization procedure based on Gaussian processes expects the Gaussian prior G(µ, Σ) as a part of input [13]. That is, the prior is decided by a human expert. What is more, in fully Bayesian case, the prior is expected to be correct. I.e. the function being optimized is a realization of the given Gaussian prior (in Figure 1, 3 realizations of Wiener processes, for instance). In this paper, we make similar assumptions. We expect that the priors are consistent and informative. These assumptions are quire realistic and very usual, because the expert has access to previous function evaluations, previous problem instances, and multiple prior sampling heuristics such as Maximum Likehood estimation.
This all being said, we stress that we do not argue to solve any notoriously hard problem in theoretical computer science. Instead we give pointers that human-algorithm collaboration can help us solve even problems considered intractable in the future as the field of human-algorithm collaboration matures. This paper is organized as follows. First we introduce the concept of Bayesian optimization. Second we show how to reduce a class of combinatorial optimization problem instances to a univariate finite domain function, which then can be approximated by our human-algorithm collaboration Bayesian optimization algorithm that we introduce thereafter. Finally we conclude our research and give some future directions.

Bayesian optimization
Global optimization (GO) aims to find optimal value(s) of functions called objective functions either in finite or bounded domains [14]. The optimal values, depending on the context, can be the global maximal or global minimal values. In general, global optimization is intractable -the number of function evaluations increases exponentially in the problem dimensions and exponentially in domain size [15]. The GO problems for continuous functions and functions with finite domains respectively are Note that any minimization problem can be reduced into a maximization problem by F := −F .
Bayesian optimization (BO) is a technique used in GO to search globally optimal values in continuous, combinatorial, and discrete domains [20]. BO has wide range of applications and during the recent years, its usage in optimizing hard black box functions has increased [21]. BO is often used in low data regimes where where evaluation of the objective function is costly or not otherwise efficient [22]. These problems include, robot navigation and planning [23], tuning the hyperparameters of deep learning models [24], predicting earthquake intensities [25], finding optimal stock values in stock markets [26] and much much more [28].
The advantage of the BO is that BO can optimize any set of black box functions. That is, if one can only access function values and not, say, gradient information, then BO can be very efficient [21]. Also, BO does not usually make any assumptions on the function it optimizes unlike multiple state-ofthe-art optimization algorithms [20]. These algorithms expect the objective function to be Lipschitz continuous [30], unimodal [31] or having a certain shape near to global optimizer [32]. Fully Bayesian optimization, however, assumes that the user has some knowledge or expectation on the shape of the objective function, which realizes as (Gaussian) prior. The main difference between BO and traditional GO is that in the BO, one is not assumed to provide, say, differentiable function -one just has to know to some extent what kind of the function is.
Bayesian optimization is mainly based on Gaussion processes [16]. Gaussian process is a stochastic process such that every finite collection of random variables has a multivariate normal distribution. Gaussian process is completely defined by its convariance function Σ (or covariance matrix in finite domains) and its mean function µ [20]. That is, G(µ, Σ).
In this paper it is assumed that for any function F considered , and that µ is some constant, say 0 for centered Gaussian process. This is very usual assumption made in BO literature.
In BO, one places a (Gaussian) prior on the unknown and possible nonconvex function which reflects one's understanding of the function -whether the function is continuous, differentiable, smooth, unimodal and so forth [20]. It is usually assumed that µ is 0 and hence, the prior is defined only by its covariance function (or covariance matrix) Σ. After the prior is set over the unknown function, an algorithm suggests points where the optimal values would lie. Based on these suggestions and function evaluations, the algorithm updates the prior to posterior and uses the posterior the suggest even better points where the possible optimum would be located. In many settings, the BO finds the global optimum much more faster than, say, brute force search [21]. The suggestions and how they are derived from the prior and posterior are defined by an acquisition function [20]. Multiple different possibilities for acquisition function exists. These include upper confidence bound [34], expected improvement [35], and Thompson sampling [36] to name few. A visualization of BO with posterior distribution of functions, sample points, and an acquisition function can be seen in Figure 2.
Multiple studies have shown that Bayesian optimization converges to global optimum in reasonable time [20] and [21]. [28] showed exponential convergence rate of Bayesian optimization to expected global optimum (that is, the expected highest or lowest value of the function depending on the context) if the function near to global optimum has a certain shape. [17] showed similar results to more While it is tempting to praise Bayesian optimization for its ability to find optimal values of the structures efficiently, one of the downsides of the approach is that the locally optimal values found and globally optimal values are based on how well the prior of the Gaussian process is defined. If the black box function is a realization of the prior, then BO works surprisingly well. On the other hand, a poor choice of a prior can result the algorithm to not converge at all. In Bayesian optimization literature [20] and [21], it is usually assumed that an expert, who has a domain knowledge in the context of the objective function provides a prior from which the objective function is realization.
The regret [22], which is used to give an idea of the convergence rate of a BO algorithm, is based on the expectations. That is, ratio between expected global optimum and expected best value found the function. The regret is defined as , where F sup is the global optimum and Y is the best value found by a BO optimization algorithm. The expectation in the regret means that on average among all functions, the Bayesian optimization algorithm will return a value that is r T close to an expected optimum. The expected values are bounded by the prior and posterior distributions of Gaussian process. In [16] the regret was defined as a multiplicative factor of expected best value found and expected global optima , which is closely related to an approximation ratio in approximation. In this paper, we complement the previous results of [16] to obtain (super-)exponential convergence rate for BO. We restrict ourselves to the objective functions that can be changed while preserving the relative scale of the function values in the original objective function. We show that any combinatorial optimization problem with combinatorial complexity of O(poly(n) poly(n) ) can be reduced to such function. Later, we show how to derive the (super-)exponential convergence rate with our human-algorithm collaboration procedure.

Combinatorial problems as univariate finite domain functions
In this section, we show that any problem with a combinatorial complexity of O(2 n ) can be reduced to a univariate function with a domain of size Ω(2 n ). The reduction is quite general and can be applied also for problems with combinatorial complexity of O(poly(n) poly(n) ): one just has to consider different branching factor m.
The reduction algorithm (seen in Algorithm 1) assumes that the problem instance can be viewed as a decision tree (see Figure 2) where a left child arc of a decision node implies that the decision variable at the node is assigned with 0 and right child arc implies 1 assignment (or vice versa). The function that the reduction produces, takes value in a domain D = [D 0 , D 1 ] and uses the value to find an assignment for the combinatorial problem instance. Finally the function evaluates the combinatorial problem with the assignment derived from the input value.
The algorithm starts by ordering the decision variables randomly. Second the algorithm creates an objective function that accepts a value from a finite domain D = [D 0 , D 1 ]. Based on the value passed to the function, the function always divides the remaining domain in the two equally sized halves and selects the half that contains the value. At each selection of the half, the assignment is updated with the corresponding decision variable value based on which of the halves was selected. When the domain has been completely split so that no more intervals can be selected, the original combinatorial problem instance is evaluated with the assignment and the value of the evaluation (number of vertices in a clique, number of clauses satisfied, or number of vertices in a vertex-cover and so forth) is scaled with the scaling parameter and eventually returned. The algorithm can be seen in Algorithm 1 and the geometric presentation of the combinatorial problem in Figure 3.
If Algorithm 1 is called with a partial assignment (that is, some variables have already been assigned with values), then the algorithm starts by randomly ordering the variables that have not been assigned to any values. Then the algorithm produces the function for the remaining variables and uses the partial assignment as a basis of the final evaluation.

Analysis of Algorithm 1
In a combinatorial problem, whose combinatorial complexity is O(2 n ), there exist 2 n different variable assignments. Algorithm 1 runs O(n) steps and at each step, the algorithm splits the remaining domain in two halves and selects another half as a new domain. The assignment for the combinatorial problem is updated based on the order of the variables, the depth at which the algorithm currently operates and whether the value of input x belongs to first or the second half of the split interval (other half indicates, say, 0 value and other half indicates 1 value). This way, the algorithm assigns each value of input x ∈ D, D = [D 0 , D 1 ] ([1, 2 n ], for instance) with a different assignment, which is used to evaluate the combinatorial problem at the end of the algorithm. Scaling parameter is used for scaling the returned value if the new scale is required -in order to produce functions from a specific Gaussian process prior. Scaling factor can also be used to group certain optimization results for the same function values. Say, 1 or 2 clauses satisfied by an assignment yields function value 1.
Random ordering of the variables from the same combinatorial optimization problem instance produces always same function values that lie in the different positions in the function domain -even if some variables are fixed as in partial assignment's case.

Human-algorithm collaboration in Bayesian optimization
In this section we refer to a recent paper of [16] where the authors showed that any ratio between expected global optimum and expected optimum found by their Bayesian optimization (BO) algorithms (UCB2 or EI2) in functions with finite domains. We complement their results to show that for a specific type of problems, their results can be extended to a polynomial time human-algorithm collaboration procedure. We stress that the algorithm is merely theoretical and is required to query Gaussian priors from an external expert multiple times.
The key ingredient of their work that we use is that -instead of standard simple regret used in Bayesian optimization -they provided proof for expected approximation ratio -the multiplicative ratio between the expected best value found by the function and the expected global optimum (normregret in equation 5). Also, their bound of the regret ratio is tight up an arbitrary constant. That is, their UCB2 or EI2 are optimal to the worst case bound. This means that the upper bound equals to lower bound in their algorithms up to some constant . Furthermore, their regret ratio is not asymptotic -which is crucial in our analysis. The upper and lower bounds for their algorithms are respectively and , where T is the number of function evaluations, N is the domain size, and is arbitrary constant depending on T ( tends to approach 0 as T increases).
As such, their work cannot be extended to polynomial time algorithm to achieve a constant factor approximation in polynomial time due to their bounds are only expectations (as in standard BO). Hence, to obtain the strict ratio provided by their algorithm, one would have to evaluate arbitrary number of different functions. We overcome this constraint by narrowing down the scope of the functions to type of functions where the function can be changed almost arbitrary many times but still depend on the same problem instance (see Algorithm 1).
Without a loss of generality, we assume, once again, that the combinatorial problems here have combinatorial complexity of O(2 n ). In Algorithm 1 we showed that any combinatorial problem with combinatorial complexity of O(2 n ), can be reduced to a black-box univariate function with a finite domain. By running the Algorithm 1 multiple times for a single instance with different scaling factors, the algorithm produces different functions from the same problem instance. This is because the algorithm always randomly sorts either all or a subset of the variables (line 2 in Algorithm 1). We use that and McDiarmid's inequality [37] to obtain with high probability a convergence to expected values. McDiarmid's inequality is defined as , where a value change of a single random variable can cause a function value to change at most c i , which resembles Lipschitz condition. and the actual inequality , where t is some constant for the difference between function of random variables and expected value of the function with random variable. McDiarmid's inequality is a concentration inequality and states that the value of a function for random variables converges to its expected value exponentially in the number of random variables if a change in one of it's parameters changes the function value only by a samll fraction (c i ). In our case, we run our reduction algorithm (Algorithm 1) S times, use either UCB2 or EI2 [16] each time and finally output the mean of the found values by UCB2 or EI2. By McDiarmid's inequality, the mean value converges exponentially fast to the bounds promised by UCB2 or EI2. We then branch to the optimal regions of the domain where the optimal function values are likely to reside. Hence, our algorithm follows quite popular geometric branch and bound framework used in multiple global optimization algorithms [17] and [19].
In [17], the search space is shrunk after function evaluations and the intervals where the global optimizer is not likely to reside are discarded. [17] and [19] cannot be used as such to solve problems with exponential sized domains because constants in their algorithm depend on the domain size and the shape of the function near to optimizer(s).
Because the prior given to the Gaussian process upper bounds the expected optimum and the expected best value found by UCB2 and EI2, at every expansion and new sample from Algorithm 1, we query a new prior from an external expert. We assume that the priors queried from the expert do not contradict each others and are informative. That is, there are no conflicting choices of, say, hyperparameters, among the priors regarding the same fractions of the search space, and that the expert can gain insights of the functions.
Querying informative priors from an expert is realistic assumption -and usually made in BO literature -since the expert has an access to multiple sampling heuristics, such as Maximum likehood estimation and previous function evaluations and domain knowledge. This differentiates querying priors from an expert to standard oracle queries in theory of computation where the oracle is unrealistic non-deterministic entity who is always correct. Our extension to [16] algorithm can be seen in Algorithm 2.
Before going into details we introduce some definitions and an assumption.
, where val is the mean of values found by EI2 or UCB2 in S runs from a cell, T = X, and N is the size of the cell. Ub is calculated for each cell as there were no approximation error.   re-sample a and b from Algorithm 1, use absolute position of a and b (in original domain) to derive a partial assignment for Algorithm 1 and D 0 , D 1 .

16:
derive covariance matrices (Σ i , Σ j ), and means (µ i , µ j ) for a and b using query to expert 17: solve UCB2 or EI2 for a, use T := X iterations, covariance matrix derived Σ i , and µ i

18:
solve UCB2 or EI2 for b, use T := X iterations, covariance matrix derived Σ j , and µ j

19:
s := s + 1 20: i := i + 1 21: j := j + 1 22: end while 23: retain original order of variables for a from V ars 24: retain original order of variables for b from V ars 25: deduce ub of a from its found UCB2 or EI2 values in S samples from Algorithm 1, size of a, and X 26: deduce ub of b from its found UCB2 or EI2 values in S samples from Algorithm 1, size of b, and X 27: current := argmax of cells (the cell with highest ub) 28: t := t + 1 29: end while 30: return max value of any cell found by UCB2 or EI2 In Algorithm 2 we show the pseudocode of the algorithm. If the priors given by the expert are informative and consistent, then Algorithm 2 converges to optimal solution in O(S ·A·K ·C ·log 2 (D)) for some C depending on the problem instance, S and X; and K (the number of function values close to global optima). This is because the Algorithm 2 with high probability expands the optimal cell in two equal sized halves -the rate of convergence is exponential. In the following the prove our claims. and , the Algorithm 2 can be modified so that it passes all S random permutations from Algorithm 1 to a black box function that optimizes permutations with EI2 or UCB2 and outputs the mean of the results. Hence, c i is bounded by a) where b and a are upper bound and lower bound of the functions values respectively. We have hence, , This implies that the value of S UCB2 or EI2 runs converges exponentially fast to its expected value in m -the number of random permutations.
The expected approximation ratio in [16] is given as , which is, This is upper bounded by By McDiarmid's inequality, the mean approximation ratio tends in between these bounds in m := S.
The cell with the highest upper bound is always expanded. This, the upper, and the lower bound imply that the cell that does not contain the expected global optimizer might be expanded instead of the cell with the optimizer -some cell might have lower approximation error and only slightly smaller expected optimum, and hence, larger ub. This is limited by , which is , where F local is a maximum of a cell other than expected global optima, and is a small constant from McDiarmid's inequality's convergence error. Equation (18) gives a limit for optimal solution.
This implies that even a cell without any approximation error, if the found value is less than (18) factor from expected global maximum, then the cell will not be expanded because it will be dominated by at least the cell with the expected global optimizer.
These prove that -with probability in m := S -only optimal cells will be expanded. Lemma 4.3. Algorithm 2 will, with very high probability, find at least optimal solution to any optimization problem derived from some Gaussian process and Algorithm 1.
Proof. When running Algorithm 2 sufficiently many iterations, the values found by UCB2 and EI2 increase, as cells' size decrease [16], and tend to cell's exåected optimum.
The probability x of selecting always optimal cell is propositional to By McDiarmid's inequality, m := S, but decreases exponentially to a depth of the tree and tend to 0: , where h is the maximum depth of the tree. On the other hand because P [x] increases exponentially in m and tend to 1, the expected required samples from Algorithm 1 to obtain P [x|h] > 0.5 is bounded polynomially in m := O(log 2 (D)) = O(h).
Lemma 4.4. Expected run time of Algorithm 2 to find at least optimal solution is O(S · A · K · C · log 2 (D)) (the time complexity of UCB2 or EI2 omitted here) if the function has K number of optimal solutions, where D is the domain size, A is the number of variables in the combinatorial optimization problem, and C depends on function values and X.
Proof. By lemma 4.3, the algorithm always expands the optimal cell with high probability in S. Because the expansion procedure divides the cell always in two equally sized halves and because optimal cell is expanded with high probability, the convergence rate is O( 1 2 N ). Fix N = log 2 (D), we have O(S·A·K·C·log 2 (D)). C depends on when the Algorithm 2 is able to distinguish the expected global optimum from similar values of the objective function. This further depends on parameter X for UCB2 or EI2. This completes our proof.
Of course, the approach provided here is merely theoretical. First assumption is that all functions are indeed realization of the Gaussian processes (which might not be the case) and secondly, as the domain size increases, the more priors (even in the best possible case) the expert has to provide, which is not feasible even for even moderate sized problems in real life. The other disadvantage of the algorithm is that some covariance functions, UCB2 or EI2 may produce better results and have less strict bounds because the bounds are general and not prior dependent. This of course, leads to a situation where for some priors, the Algorithm 1 spends more time in searching for sub-optimal regions of the search space. A sample run of the search space shrinking can be seen in Figure 4. The term is very very small. For instance, fixing T := X = 10 7 yields less than 2% approximation error for large domains. In equation (18) we showed that the approximation error in propositional only to X.

Wiener processes revisited
Wiener process W is a stochastic Gaussian process defined by independent and Gaussian increments and The covariance function is defined in W for W s and W t , t ≥ s as min{s, t} , for Wiener process with unit variance Because the only hyperparameter for the Wiener process is the variance, which can be considered as unit variance by scaling factor in Algorithm 1, a selection of Wiener processes as statistical model in Algorithm 2 requires no expert interference.
UCB2 or EI2 algorithm might not offer a tight bounds for (discrete) Wiener process. However, by lower bound theory there must exist an algorithm where the non-asymptotic normregret upper bound matches to normregret lower bound at least up to a constant factor. One of such algorithms is Pure random search for standard -continuous Wiener process [38]. For a standard Wiener process, one can easily modify Algorithm 1 to reduce a combinatorial optimization problem into continuous domain function as X = [1, D] d , d = 1.
This implies that by considering Gaussian processes where the only hyperparameter of the model is a scaling factor, one can use the Algorithms 1 and 2 for the traditional optimization.
In Wiener process' case, the approximation ratio is based on the function value distribution under the Wiener measure.

Conclusions
In this paper we have shown that with a type of human-in-the-loop Bayesian optimization one can solve any combinatorial optimization problems (either minimization or maximization) in polynomial time as long as the combinatorial complexity of the problem is O(2 n ) and we can assume that expert can correctly provide O(n) Gaussian process priors. While our proposed algorithm is merely theoretical and has only little practical interest, we assume that our approach could be used to gain new insights on how to tackle the hardest NP-hard combinatorial problems in the future.
As it is usually expected, most of combinatorial optimization problems might be intractable. Our research casts new light how to avoid the common pitfalls faced while solving these problems: to involve human expertise in the optimization process. In our approach, the human expert is not required to know where the optimizers lie in the search space. Instead, the expert is only assumed to provide a Gaussian prior when the expert is queried from. The prior captures expert's opinions on how drastically the nearby values might change in the function and whether the changes are periodic and so forth.
The expert knowledge can be supported with previously calculated values of the objective function and, for instance, Maximum likehood estimation [17]. [17] gave bounds on sample sizes of random variables (in our case, the function values) to derive a certain probability in optimization of a likehood function for covariance hyperparameters. Because the bounds do not depend on the domain size of the function, fixing a sample size and a covariance function for covariance matrix Σ yields a constant time estimation of a prior. This can be combined with the expert knowledge.
Our results indicate that in the future, in order to understand the fundamental limits of our tools (whether it is a combinatorial optimization algorithm or an AI system) a human interference might be required. The scope of possible comibinatorial problems that can be solved efficiently in theory through our human-algorithm collaboration procedure include enormous amount of combinatorial problems. Such problems include protein structure prediction, finding maximum sized cliques, model checking, automated theorem proving (basically every problem in NPO and even beyond to problems with super-exponential combinatorial complexity).