Bayesian Optimization

  1. 1. Introduction
    1. 1.1. The problem
    2. 1.2. Surrogate-based optimization
  2. 2. Bayesian optimization
    1. 2.1. Concepts
    2. 2.2. Surrogate model
    3. 2.3. Acquisition function for unconstrained optimization
      1. 2.3.1. PI
      2. 2.3.2. EI
      3. 2.3.3. LCB
      4. 2.3.4. Discussion
    4. 2.4. Treatment of constraints
    5. 2.5. Miscellaneous topics
      1. 2.5.1. Choice of optimization algorithms
      2. 2.5.2. Relearn rate of the surrogate
  3. 3. Final note

How to solve an optimization problem when the objective function and/or the constraints are expensive/impossible to compute?

Introduction

The problem

The objective function is expensive to compute when it is, for example, the lift-to-drag ratio of an aircraft wing using a computational fluid dynamics (CFD) solver, or nonlinear deformation of composite structure using a finite element (FE) solver. The objective function is impossible to compute when it is, for example, data obtained from experiments.

The optimization problem is written formally as, argmindJ(d;U)s.t.cE(d;U)=0cI(d;U)0\displaystyle \begin{aligned} {\mathrm{arg\,min}}_{\mathbf{d}}&\quad J({\mathbf{d}};{\mathbf{U}}) \\ {\mathrm{s.t.}}&\quad {\mathbf{c}}_E({\mathbf{d}};{\mathbf{U}}) = 0 \\ &\quad {\mathbf{c}}_I({\mathbf{d}};{\mathbf{U}}) \ge 0 \end{aligned} where d{\mathbf{d}} is the vector of design variables, cE{\mathbf{c}}_E and cI{\mathbf{c}}_I are the aggregated equality and inequality constraint functions, respectively. The state variables U{\mathbf{U}} are those implicitly involved in the evaluation of the objective and constraint functions. Examples of U{\mathbf{U}} are the flow variables in a CFD solver or the displacement field in the structural FE solver.

In above cases, in order to solve the optimization problem during a practical time period, the number of evaluation of the objective function has to be limited in the optimization algorithm. There are two limitations on the optimization algorithm: (1) The design space cannot be explored by carrying out numerous direct evaluation of the objective function; (2) The derivative of the objective function w.r.t. design variables cannot be computed using the finite difference approach. These limitations prevent the direct application of any gradient-free or gradient-based algorithms to the optimization problem.

Surrogate-based optimization

To overcome the limitations, one approach is to use surrogate-based optimization (SBO). The SBO algorithms typically contain two key ingredients, a surrogate model and an acquisition function. The surrogate model is employed to approximate the expensive objective function. Since the surrogate is computationally efficient, it allows for fast evaluation of approximated objective function as well as its derivative w.r.t. the design variables. The acquisition function is a criterion for selecting the points in the design space that is potentially a solution to the optimization problem. The acquisition function is designed to take into account of both exploration, i.e. sampling from areas of high uncertainty, and exploitation, i.e. sampling from areas likely to contain the minimizer of the objective function.

The general procedure of SBO is as follows,

  1. Step 1: Generate a sample data set D={(di,Ji,cEi,cIi)}i=1Ns{\mathcal{D}}=\{({\mathbf{d}}_i, J_i, {\mathbf{c}}_{Ei}, {\mathbf{c}}_{Ii})\}_{i=1}^{N_s} by evaluating the objective and constraint functions Ji,cEi,cIiJ_i, {\mathbf{c}}_{Ei}, {\mathbf{c}}_{Ii} at a few sample points di{\mathbf{d}}_i in the design space.
  2. Step 2: Generate surrogates Jsur(d)J^{sur}({\mathbf{d}}), cEsur(d){\mathbf{c}}_E^{sur}({\mathbf{d}}), cIsur(d){\mathbf{c}}_I^{sur}({\mathbf{d}}) using the sample data set D{\mathcal{D}}.
  3. Step 3: Find the design point d{\mathbf{d}}^* by solving an optimization problem that consists of the surrogates and acquisition function.
  4. Step 4: Evaluate the objective and constraint functions J,cE,cIJ^*, {\mathbf{c}}_E^*, {\mathbf{c}}_I^* at the new design point d{\mathbf{d}}^*.
  5. Step 5: Update the surrogate using the sample data set augmented with the new design point D=D{(d,J,cE,cI)}{\mathcal{D}}^*={\mathcal{D}}\cup \{({\mathbf{d}}^*,J^*,{\mathbf{c}}_E^*,{\mathbf{c}}_I^*)\}.
  6. Step 6: Repeat steps 3-5 until the convergence or stopping conditions are reached.

One category of SBO is the one-shot approach, which only contains steps 1-4. In the one-shot approach, the acquisition function is the surrogate itself. The issue is that the initial sample set does not necessarily represent well the distribution of the objective function over the whole design space. As a result, the optimal solution found by the algorithm may be far off from the true value of objective function. The other category of SBO is the updating approach that, of course, contains all the six steps. This is the category where the Bayesian optimization lies.

Bayesian optimization

Concepts

Bayesian optimization (BO) is essentially the six-step SBO procedure with a statistical interpretation. Adapted from wikipedia: The expensive function is treated as a random function with a prior distribution, which captures the beliefs about the behavior of the function. The prior distribution is updated using the data set to form the the posterior distribution over the objective function. The posterior distribution is used to construct an acquisition function that determines the next design point.

The two precursors of BO are Kushner1964 and Mockus1975. A good review of BO is provided in Brochu2010. The Gaussian process regression (GPR) model is a popular choice for the surrogate model in BO. Some outstanding the GPR-based BO algorithms include, but not limited to,

  1. BayesOpt: code, doc, paper.
  2. HIPS/spearmint: code.
  3. GPyOpt: code, doc
  4. GPflowOpt: code, doc, paper

Besides GPR, there are BO algorithms using other surrogate models, such as those in SMAC (code, paper). A list of implementations of BO is provided here. Finally, note that in the engineering community, the BO algorithm tends to appear by the name efficient global optimization (EGO) algorithm Jones1998, Sasena2002.

Next, the key components of BO, the surrogate model and the acquisition function, as well as the treatment of constraints will be discussed.

Surrogate model

In this and the following sections, the discussion will be based on the GPR model. The prediction of the GPR model at a new design point d{\mathbf{d}}^* follows a Gaussian probability distribution, J=Jsur(d)N(μ(d),σ2(d)),orP(Jsur=J)=12πσ2exp((Jμ)22σ2)\displaystyle J^* = J^{sur}({\mathbf{d}}^*) \sim {\mathcal{N}}(\mu({\mathbf{d}}^*), {\sigma^2}({\mathbf{d}}^*)),\quad\mathrm{or}\quad P(J^{sur}=J^*)=\frac{1}{\sqrt{2\pi{\sigma^2}}}\exp\left(-\frac{(J^*-\mu)^2}{2{\sigma^2}}\right) where μ\mu is the predicted value of objective function and σ2{\sigma^2} is the variance, i.e. the uncertainty of the prediction. The details of GPR models have been discussed in the previous articles and will be skipped.

Acquisition function for unconstrained optimization

The BO is initially developed for unconstrained problems. Surveys of acquisition functions for such problems can be found in, for example, [Sasena2002] and Gelbart2015. There are three basic acquisition functions: (1) Probability Improvement (PI) [Kushner1964], (2) Expected Improvement (EI) Mockus1994, Lizotte2008, (3) Lower Confidence Bound (LCB) Cox1992. Besides the basic ones, there are acquisition functions based on Thompson Sampling and the information theory (entropy). In general, it appears that the acquisition functions are evolving towards (1) finding multiple candidate points in one iteration, (2) taking advantage of parallel computing via techniques like asynchronization. Nonetheless, this article will focus only on the three basic acquisition functions.

PI

Defined as the probability of the new design point d{\mathbf{d}}^* to offer a better value of the objective function JJ^* than the minimum objective function in the sample data set Jmin=min{Ji}J_{\min}=\min\{J_i\}. CPI(d)=P(JJmin)\displaystyle C_{PI}({\mathbf{d}}^*) = P(J^*\le J_{\min})

EI

Defined as the expectation of the improvement in the objective function at the new design point. In the literature, EI has been generalized to include user-specified parameters that control the exploitation-exploration trade-off. The generalized EI is written as, CEI(d)=E[I(d)],I(d)=max(0,(JminJsur(d)ζσ(d))g)\displaystyle C_{EI}({\mathbf{d}}^*) = {\mathbb{E}}[I({\mathbf{d}}^*)],\quad I({\mathbf{d}}) = \max(0,(J_{\min}-J^{sur}({\mathbf{d}})-\zeta\sigma({\mathbf{d}}))^g) where ζ0\zeta\ge 0 and g1g\ge 1 are the user-specified parameters. The classical form of EI is obtained with ζ=0\zeta=0 and g=1g=1. A larger gg or ζ\zeta will put more weight on the exploration. Note that when ζ=0\zeta=0 and g=0g=0, EI reduces to PI. For a GPR model, the closed-form expression is available for CEIC_{EI}. For the case g=1g=1, CEI(d)=0IP(Jsur=JminζσI)dI={σ[zΦ(z)+ϕ(z)],σ>00,σ=0\displaystyle \begin{aligned} C_{EI}({\mathbf{d}}^*) &= \int_0^\infty I P(J^{sur}=J_{\min}-\zeta\sigma-I) dI \\ &= \left\{ \begin{array}{ll} \sigma [z\Phi(z) + \phi(z)],&\quad \sigma > 0 \\ 0,&\quad \sigma = 0 \end{array} \right. \end{aligned} where z=Jminζσμσz=\frac{J_{\min}-\zeta\sigma-\mu}{\sigma}, and Φ\Phi and ϕ\phi are the cummulative distribution and probability density functions, respectively. The gradient of CEIC_{EI} w.r.t. d{\mathbf{d}} for σ>0\sigma>0 is, CEId=μdΦ(z)+σd[ϕ(z)ζΦ(z)]\displaystyle {\frac{\partial C_{EI}}{\partial {\mathbf{d}}}} = - {\frac{\partial \mu}{\partial {\mathbf{d}}}}\Phi(z) + {\frac{\partial \sigma}{\partial {\mathbf{d}}}}[\phi(z)-\zeta\Phi(z)] where the computation of μd{\frac{\partial \mu}{\partial {\mathbf{d}}}} and σd{\frac{\partial \sigma}{\partial {\mathbf{d}}}} has been discussed in a previous article.

For the case ζ=0\zeta=0, refer to P. 64 of [Sasena2002].

LCB

Defined using the LCB concept, CLCB(d)=μ(d)ζσ(d)\displaystyle C_{LCB}({\mathbf{d}}^*) = \mu({\mathbf{d}}^*) - \zeta\sigma({\mathbf{d}}^*) where the probability of Jsur<CLCBJ^{sur}<C_{LCB} is a constant controlled by the user-specified parameter ζ>0\zeta>0. A larger ζ\zeta will put more weight on the exploration. The gradient of CLCBC_{LCB} w.r.t. d{\mathbf{d}} is straight-forward.

Discussion

The PI acquisition function is purely exploitation, which is undesirable for global optimization. The EI and LCB acquisition functions are high when JJ^* approaches the optimum point, or the uncertainty of JJ^* is high. Therefore, both CEIC_{EI} and CLCBC_{LCB} achieve a balance between exploitation and exploration.

The exploitation-exploration trade-off of EI and LCB functions can be further tuned by the cooling scheme. In the cooling scheme, the optimization starts with a large user-specified parameter for more exploration and gradually decreases the parameter to focus on exploitation. However, the effect of this scheme is controversial [Sasena2002] and [Brochu2010].

Treatment of constraints

The acquisition functions discussed in the previous section work well with optimization problems without constraints or with box constraints only. For more complex constraints, there are two types of treatments. The first type is the direct approach. As suggested in [Sasena2002], the objective and constraint functions are replaced with the surrogates, and the following problem is solved, argmindC(Jsur,d)s.t.cEsur(d)=0cIsur(d)0\displaystyle \begin{aligned} {\mathrm{arg\,min}}_{\mathbf{d}}&\quad C(J^{sur}, {\mathbf{d}}) \\ {\mathrm{s.t.}}&\quad {\mathbf{c}}_E^{sur}({\mathbf{d}}) = 0 \\ &\quad {\mathbf{c}}_I^{sur}({\mathbf{d}}) \ge 0 \end{aligned} Similar to [Sasena2002] is the expected violation (EV) method Audet2000: The optimization is carried out by filling the design space using latin hypercube sampling and filtering out infeasible candidate points using the so-called EV function, which takes the same form of EI. Essentially, the constraint surrogates csur{\mathbf{c}}^{sur} are replaced by CEV(csur,d)C_{EV}({\mathbf{c}}^{sur}, {\mathbf{d}}).

The other type is the indirect approach, which is further divided into two categories, as discussed in [Gelbart2015]. In the first category, the constrained problem is converted to an unconstrained one using classical (non-BO) methods. Three representative approaches are,

  1. Barrier methods: The iterates are prevented from leaving the feasible region by augmenting the objective function with a barrier function, which causes the objective to grow to infinity at the boundary of the feasible region.
  2. Penalty methods: The objective function is augmented with a penalty term, which still allows the iterates to leave the feasible region but the iterates become feasible as the penalty increases to infinity in the process of optimization.
  3. Augmented Lagrangian methods: The problem is solved via the Lagrangian form augmented with the extra term in penalty methods, while the Lagrangian multipliers are controlled by the penalty parameters.

In the second category of indirect approach, the acquisition function is modified to incorporate the constraint surrogates. two representative approaches are,

  1. Modified EI methods: An example is the EI with constraints (EIC) Schonlau1998. The EI for the objective function is augmented to take account of the constraints. In EIC, the EI is multiplied by the probability that the constraints are satisfied, so that the improvement (almost) only occurs at feasible candidate points.
  2. Marginalization integral (MI) methods: Examples are integrated expected conditional improvement Gramacy2011 and expected volume minimization Picheny2014. The acqusition function at d{\mathbf{d}} is defined using an integral over the whole design space, CMI(d)=F(d,d)h(d)dd\displaystyle C_{MI}({\mathbf{d}}) = \int F({\mathbf{d}},{\mathbf{d}}')h({\mathbf{d}}')d{\mathbf{d}}' where F(d,d)F({\mathbf{d}},{\mathbf{d}}') is a measure of improvement at d{\mathbf{d}}' given observation at d{\mathbf{d}} and h(d)h({\mathbf{d}}') is the probability that the constraints are satisfied at d{\mathbf{d}}'. The maximizer of CMIC_{MI} provides the best overall improvement in the design space. The inclusion of h(d)h({\mathbf{d}}) encourages picking candidate points likely to be feasible, while still permits the picking of points in infeasible region that may provide improvement over the whole design space.

In terms of programming, the direct approach is probably the easiest one to implement. Particularly, the non-BO indirect approaches may be applied to solve the subproblems derived from the the direct approach, if standard packages for numerical optimization are employed. The challenge in the MI methods is the integration over the design space, which could be intractable in higher dimensions.

Miscellaneous topics

Choice of optimization algorithms

No matter which algorithm is used, eventually the BO process boils down to a series of non-convex optimization subproblems. It is hard to find the global optimum of a non-convex function. One choice is to use the evolutionary algorithms, or some so-called global algorithms like the DIRECT, which are claimed to be able to find the global optimum given sufficiently many iterations. Another choice is to apply the algorithms for convex optimization, especially the gradient-based ones, with multiple restarts.

I personally prefer the latter. Both choices do not guarantee the convergence to global optimum. The convergence rate of the former is much slower than the latter, especially in high-dimensional space. In practice, multiple restarts usually result in a good global sub-optimal point that is sufficient for the engineering purposes. A merit of the former, though, is that some algorithms can handle disjoint feasible regions.

Relearn rate of the surrogate

The training of the surrogates could become expensive as the number of samples increases. Also, contrary to intuition, updating the hyperparameters of the surrogate at every iteration may be unnecessary or even counterproductive Bull2011. A better strategy is to introduce a relearn rate: Update the sample data set in the surrogate every iteration, but update the hyperparameters every few iterations.

In practice, the effectiveness of this strategy depends on the quality of the surrogate implementation. If the current hyperparameters are far off from the converged values, keeping the current hyperparameters would reduce the convergence rate of the whole BO algorithm. As for the GPR model, the update of sample data set has been discussed in a previous article.

Final note

To be fair, it is worth mentioning the existence of adjoint-based optimization. This methodology applies to optimization problems constrained by partial differential equations (PDEs). One can incorporate the adjoint capability in their PDE solver and enable the direct gradient calculation of the expensive objective and constraint functions. This approach applies to optimization problems involving CFD and FE solvers. The adjoint method would be a different (but interesting) story that I will discuss in future.