Skip to main content

Optimisation: Basic Basinhopping

Basinhopping optimisation routines are a family of routines designed to find the global minimum in the presence of local minima. The basic principle behind a basinhopping routine is to randomly perturb initial parameters at each iteration from the current best set and perform a local minimisation routine at each starting point.

The best parameters are updated as the routine progresses if a local minimum is found with a smaller evaluation of the objective function than the current best.

In this post I want to focus on the basinhopping side of the algorithm rather than the local minimisation. I am therefore going to use the steepest descent algorithm I built in the following post to perform local minimisation: https://harrybevins.blogspot.com/2020/04/optimisation-by-descent.html.

There are a large number of different local minimisation routines that can be used however including a Gradient Descent or for example a Nelder Mead routine.

A simple basinhopping routine is characterised by the following set of variables;

  • 'iters' - The maximum total number of iterations of the basinhopping routine or put another way the maximum total number of random starting points allowed.
  • 'maxstepsize' - This is the maximum perturbation allowed in the random parameters where they are perturbed from the current best set. Effectively our new parameters are chosen from a hypercube of side length 2*'maxstepsize' centred on the current best parameters at each iteration. The sampling can be done with any chosen probability distribution e.g. gaussian with a mean given by the best parameters and a variable standard deviation that would define how likely we are to pick initial parameters a long way from the best set but still with in the 'maxstepsize'. Alternatively we can pick the new initial parameters at each iteration with a uniform distribution.
The simplest of models would in theory only need these parameters to work however 'iters' is usually a large number so the runtime can be large. This can be reduced if we include another parameter;
  • 'success_iters' - If after this number of iterations the best parameter estimate has not updated then the algorithm stops.
'maxstepsize' is also traditionally set to be large so that we do not exclude regions of the parameter space in which the global minimum may be found. However, as the algorithm progresses our knowledge of where the global minimum is should improve and so we can introduce another variable;
  •  'interval' - The number of iterations after which the maximum step size, 'maxstepsize', is decreased.
I am defining my basinhopping code as a callable class as I had previously done with my steepest descent algorithm in order for it to be reusable. I initiate the class like so,

import numpy as np
from steepest_descent import descent

class hopping(object):
    def __init__(self,x, y, model, **kwargs):
        self.x = x
        self.y = y
        self.model = model

        self.iters = kwargs.pop('iters', 1000)
        self.descent_iters = kwargs.pop('descent_iters', 1000)
        self.initial_parameters = kwargs.pop('initial_parameters', [1, 1])
        self.stepsize = kwargs.pop('stepsize', 0.5)
        self.maxstepsize = kwargs.pop('max_stepsize', 5)
        self.success_iters = kwargs.pop('success_iters', 50)
        self.interval = kwargs.pop('interval', 10)

        self.best_params = self.fit()

Here I have imported the numpy library and my previously built descent algorithm to perform the local minimisation. The variables 'descent_iters' and 'stepsize' correspond to the descent algorithm parameters and I have included them here as keyword arguments so that the user of the basinhopping routine can easily alter there values. As with the steepest descent algorithm 'model' is a function that takes the data x and y along with the parameters and produces an estimate of the y data values.

I am using chi squared, $X^2$ as my objective function and begin my 'fit' function by running the decent algorithm for the initial parameter guesses and taking the resultant parameters as my initial best parameters.

def fit(self):

    res = descent(self.x, self.y, self.model,
        initial_parameters=self.initial_parameters,
        stepsize=self.stepsize, iters=self.descent_iters)
    best_params = res.best_params
    f_best = np.sum((self.y - self.model(self.x, self.y, best_params))**2)

I keep track of both the best parameters, 'best_params' and the best or minimum objective function evaluation with the variable 'f_best'.

The basinhopping routine then falls inside a while loop which is open as long as the iteration number is less than the maximum allowed number of iterations or the number of iterations since a change in best parameters is less than the variable 'success_iters'. I therefore need also to keep track of the number of iterations that have passed and the number without a change in best parameter estimates. This is done with the variables 'count' and 'count_change' which are initiated with a value of 0.

count = 0
count_change = 0
while count <= self.iters:

I proceed to define my maximum allowed step size in the variable 'step'. This is initially set at the input value of 'maxstepsize' but after 'interval' iterations is halved.

if count == 0:
    step = self.maxstepsize
if count%self.interval == 0 and count != 0:
    step /= 2

The choice to half the maximum allowed step size is not definite and alternatives can include exponentially decreasing the maximum step size if the parameter space in which the global minimum sits is large. The rate at which this maximum step size is decreased, here every 'interval' iterations, can also effect the likely hood of finding the global minimum. To fast and you can rapidly reduce your search space around a local minimum but to slow and you are effectively randomly searching the parameter space with no guarantee of discovering the global minimum.

The next step is to produce my new initial parameters and calculate there associated evaluation of the objective function after a steepest descent.

r = np.random.randint(0, 2, len(self.initial_parameters))
initial_params = np.empty(len(best_params))
for i in range(len(r)):
    initial_params[i] = best_params[i] + \
        (-1)**r[i]*np.random.uniform(0, step, 1)
res = descent(self.x, self.y, self.model,
    initial_parameters=initial_params,
    stepsize=self.stepsize, iters=self.descent_iters)
params = res.best_params
f = np.sum((self.y - self.model(self.x, self.y, params))**2)

'r' is used to randomly pick the direction in which we step for each parameter where $(-1)^0 = 1$ and $(-1)^1 = -1$ and the amount by which with step is randomly chosen from a uniform distribution.

Finally we check whether the new parameters, 'params', and objective function value, 'f', are better than the previous best fit.

if f < f_best:
   if np.isclose(f, f_best, rtol=1e-5, atol=1e-5):
        f_best = f
        best_params = params
        break
    f_best = f
    best_params = params
    count_change = 0
else:
    count_change += 1
    if count_change == self.success_iters:
        break
count += 1

If the 'f' is less than 'f_best' we replace 'f_best' and the 'best_params'. We also reset the variable 'count_change' to 0 recalling that this tracks the number of iterations without an update to the best parameters. If 'f' is worse than 'f_best' add one to 'count_change'. If 'count_change' then equals the 'success_iters' the while loop is broken. Finally we add one to 'count' which, as mentioned, tracks the number of iterations.

If 'f' is less than 'f_best' but they agree to a tolerance of $10^{-5}$ than we also break the while loop. This prevents an unnecessary prolonging of the routine when we only require a certain degree of accuracy in our parameters. The required accuracy in 'f' and consequently the parameters could be made into a user defined input with a keyword argument. However for the purposes of this example I have left it hard coded.

Once the while loop has exited then the class simply returns the variable 'best_params' to the user.

I use the same example to test this algorithm that I did with my steepest descent model. Namely I fit $y=5x + 5x^2$ with a polynomial of the form $y=Ax + Bx^2$. This is shown in the following snippet of code.

import numpy as np
from basinhopping import hopping

x = np.linspace(0, 10, 100)
y = 5*x + 5*x**2

def model(x, y, params):
    y_fit = params[0] * x + params[1] *x**2
    return y_fit

best_params = hopping(x, y, model, initial_parameters=[4, 10],
    max_stepsize=10, interval=5, success_iters=20)

I have chosen the above parameters so that I can illustrate how this algorithm works. In reality 'interval' and 'success_iters' often need to be higher in value. For this specific problem a 'max_stepsize' of 10 is sufficient but again this is problem dependent.

The 'initial_parameters' can also play a deciding role in the value of the other variables in the problem. A good initial guess will often mean that the three other parameters can be set to have much smaller values than for a bad initial guess. This is because we would need to explore a smaller region of the parameter space and will be able to find the global minimum easily if we start the algorithm in it's vicinity. However, determining a good initial guess is often hard if not impossible without visualising the objective function space.

The graph below shows the basinhopping process for the above problem and settings. It illustrates how the randomly generated initial parameters, triangular point, are constrained within a increasingly smaller box with side length determined by the maximum step size. It also shows how the maximum allowed step size is halved after 'interval' iterations. We see how the best fitting parameter estimate, star shaped marker, is continually updated when the steepest descent algorithm ends, circular marker, at a better position in the parameter space.

The graph below shows how best objective function evaluation is continually updated for the above problem aswell.

With more appropriate settings this algorithm can well approximate the true parameters that describe the data. For initial parameters of [4, 8] the steepest descent algorithm alone returns $X^2 = 3351$ whereas the basinhopping routine can achieve a $X^2=0.426$ with a best parameter estimate of [4.95, 5.00] to two decimal places.

The code for this project and the code used for the above plots is available at: https://github.com/htjb/Blog/tree/master/basinhopping

Thanks for reading!


Comments

Popular posts from this blog

Random Number Generation: Box-Muller Transform

Knowing how to generate random numbers is a key tool in any data scientists tool box. They appear in multiple different optimisation routines and machine learning processes. However, we often use random number generators built into programming languages without thinking about what is happening below the surface. For example in Python if I want to generate a random number uniformally distributed between 0 and 1 all I need to do is import numpy and use the np.random.uniform() function. Similarly if I want gaussian random numbers to for example simulate random noise in an experiment all I need to do is use np.random.normal(). But what is actually happening when I call these functions? and how do I go about generating random numbers from scratch? This is the first of hopefully a number of blog posts on the subject of random numbers and generating random numbers. There are multiple different methods that can be used in order to do this such as the inverse probability transform method and I

LDL Decomposition with Python

I recently wrote a post on the Cholesky decomposition of a matrix. You can read more about the Cholesky decomposition here;  https://harrybevins.blogspot.com/2020/04/cholesky-decomposition-and-identity.html . A closely related and more stable decomposition is the LDL decomposition which has the form, $\textbf{Q} = \textbf{LDL*}$, where $\textbf{L}$ is a lower triangular matrix with diagonal entries equal to 1, $\textbf{L*}$ is it's complex conjugate and $\textbf{D}$ is a diagonal matrix. Again an LDL decomposition can be performed using the Scipy or numpy linear algebra packages but it is a far more rewarding experience to write the code. This also often leads to a better understanding of what is happening during this decomposition. The relationship between the two decompositions, Cholesky and LDL, can be expressed like so, $\textbf{Q} = \textbf{LDL*} = \textbf{LD}^{1/2}(\textbf{D})^{1/2}\textbf{*L*} = \textbf{LD}^{1/2}(\textbf{LD}^{1/2})\textbf{*}$. A simple way to calcu

Random Number Generation: Inverse Transform Sampling with Python

Following on from my previous post, in which I showed how to generate random normally distributed numbers using the Box-Muller Transform, I want to demonstrate how Inverse Transform Sampling(ITS) can be used to generate random exponentially distributed numbers. The description of the Box-Muller Transform can be found here:  https://astroanddata.blogspot.com/2020/06/random-number-generation-box-muller.html . As discussed in my previous post random numbers appear everywhere in data analysis and knowing how to generate them is an important part of any data scientists tool box. ITS takes a sample of uniformly distributed numbers and maps them onto a chosen probability density function via the cumulative distribution function (CDF). In our case the chosen probability density function is for an exponential distribution given by, $P_d(x) = \lambda \exp(-\lambda x)$. This is a common distribution that describes events that occur independently, continuously and with an average constant rate, $\