Skip to main content

Optimisation: Descent Algorithm

There are a large number of different optimisation routines being used for fitting data sets and some are more effective than others. A simple descent based algorithm can be produced using Python and I aim in this post to describe how this can be done. I will also consider the advantages and disadvantages of this method.

The basic idea behind this algorithm is to step around the parameter space of our model and move in a direction that best improves the value of our objective function until no further improvement can be made. In this instance I will be using $X^2$ as my objective function which is given by,

$X^2 = \sum_i(y_i - y_{fitted, i})^2.$

The stepping routine can be defined in many different ways. For example it can be defined based on the derivative of the fitted model, known as a Gradient Descent, or on a continually updated parameter step with checks in every direction in the parameter space. For this example I will use a fixed parameter step but the code can be easily modified later to adjust this routine.

I will be attempting to fit the following data using this descent algorithm,

x = np.linspace(0, 10, 100)
y = 5*x + 5*x**2

For this example I am sticking to a 2-D parameter space. This method is possible in higher dimensions but the possible stepping directions increase dramatically with increasing dimensions. In turn this increases run time.

I want to write this code as a callable class so that I can re-use it in the future with only minor modifications. I begin by defining the class and the relevant variables needed,

class descent(object):
    def __init__(self, x, y, model, **kwargs):
        self.x = x
        self.y = y
        self.model = model

        self.iters = kwargs.pop('iters', 100)
        self.initial_parameters = kwargs.pop('initial_parameters', [1, 1])
        self.stepsize = kwargs.pop('stepsize', 0.5)

        self.best_params, self.path = self.fit()

Here 'model' is a function that returns my fitted y for a given set of parameters. In this instance my model is, for simplicity, $y=P_0 x+P_1 x^2$ and this is coded into a function in a separate script that will eventually call my class as,

def model(x, y, params):
    y_fit = params[0] * x + params[1] *x**2
    return y_fit

'stepsize' defines the amount by which we wish to move each parameter at each iteration. By using kwargs for 'iters', 'initial_parameters' and 'stepsize' we allow the user to adjust the parameters of the algorithm whilst also being able to set reasonable default values. The definition of 'iters' and 'initial_parameters' will become apparent as we build the algorithm.  In the class definition the line,

self.best_params, self.path = self.fit()

where self.fit() refers to the function inside our 'descent' class that we will perform the fitting routine. We would like this function to return the best fitting parameters, 'best_params' and for illustrative plotting purposes we would like to return the path through the parameter space that the routine has taken.

We need some initial guess at our best fitting parameters and an associated evaluation of the objective function. Therefore we begin our fitting function with,

def fit(self):

        initial_fit = self.model(self.x, self.y, self.initial_parameters)
        f_old = np.sum((self.y - initial_fit)**2)

We define our initial $X^2$ value as 'f_old' in the hope that we can update this by stepping around the parameter space. We also need to define a 'best_params' variable which can be updated as the algorithm progresses. This 'best_params' variable is the starting point of our path through the parameter space and is just given by our initial parameter guesses.

best_params = self.initial_parameters
path = []
path.append(best_params)

Our path will be a list of parameter sets and as such we define it as above. Our algorithm needs to check steps in all possible directions for it to work. It will continue to check in all possible directions as long as it can improve the objective function evaluation. This can be achieved with a while loop,

f_new = 0
count = 0
while f_new < f_old:
   if f_new != 0:
       f_old = f_new
   count +=1

'f_new' will be the best evaluation of the objective function at each iteration and the program will continue through the while loop as long as this is smaller than the previous best value, 'f_old' which we replace with 'f_new' at each iteration. The count variable is used to make sure we never exceed 'self.iters' defined above which is the maximum allowed number of iterations. When 'count == self.iters' we break out of the while loop with,

if count == self.iters:
    print('Maximum iterations reached')
    break

which is the final block of code in the loop. To check the value of the objective function in each direction we have to step our parameters in all 8 possible surrounding directions. This is done with the following block of code inside the while loop,

f = []
tested_params = []
# stepping right and up
for i in range(len(self.initial_parameters)):
    new_params = best_params.copy()
    new_params[i] = best_params.copy()[i] + self.stepsize
    fit = self.model(self.x, self.y, new_params)
    f.append(np.sum((self.y - fit)**2))
    tested_params.append(new_params)

# stepping left and down
for i in range(len(new_params)):
    new_params = best_params.copy()
    new_params[i] = best_params.copy()[i] - self.stepsize
    fit = self.model(self.x, self.y, new_params)
    f.append(np.sum((self.y - fit)**2))
    tested_params.append(new_params)

# stepping diagonally up to the right
new_params = [best_params.copy()[i] + self.stepsize
    for i in range(len(best_params))]
fit = self.model(self.x, self.y, new_params)
f.append(np.sum((self.y - fit)**2))
tested_params.append(new_params)

# stepping diagonally down to the left
new_params = [best_params.copy()[i] - self.stepsize
    for i in range(len(best_params))]
fit = self.model(self.x, self.y, new_params)
f.append(np.sum((self.y - fit)**2))
tested_params.append(new_params)

# stepping diagonally up to the left
new_params = np.empty(len(best_params))
for i in range(len(new_params)):
    if i%2:
        new_params[i] = best_params.copy()[i] + self.stepsize
    if i%2 == 0:
        new_params[i] = best_params.copy()[i] - self.stepsize
fit = self.model(self.x, self.y, new_params)
f.append(np.sum((self.y - fit)**2))
tested_params.append(new_params)

# stepping diagonally down to the right
new_params = np.empty(len(best_params))
for i in range(len(new_params)):
    if i%2:
        new_params[i] = best_params.copy()[i] - self.stepsize
    if i%2 == 0:
        new_params[i] = best_params.copy()[i] + self.stepsize
fit = self.model(self.x, self.y, new_params)
f.append(np.sum((self.y - fit)**2))
tested_params.append(new_params)

Each new parameter set is built from the previous best and we store both the tested parameters and associated objective function evaluation as we step in different directions. At the end of the while loop before we check the total number of iterations hasn't reached the allowed maximum we have,

f_new = min(f)
if f_new < f_old:
    for i in range(len(f)):
       if f[i] == f_new:
           best_params = tested_params[i]
           path.append(best_params)

Here we take the minimum of the evaluated objective functions at each iteration and assign it to 'f_new'. It's size remember determines whether we remain in the while loop or not. If we are set to remain in the while loop, i.e. 'f_new < f_old', we update our best parameter estimate with the parameters associated with 'f_new'.

The class is finished by returning from the 'fit' function the 'best_params' variable and the 'path' list.

The full code for this project, including the code to produce the following plots, is available at https://github.com/htjb/Blog/tree/master/Steepest_descent.

I call the class in a separate program shown below,

import numpy as np
from steepest_descent import descent

x = np.linspace(0, 10, 100)
y = 5*x + 5*x**2

def model(x, y, params):
    y_fit = params[0] * x + params[1] *x**2
    return y_fit

best_params = descent(x, y, model, initial_parameters=[4, 2.5]).best_params
print(best_params)

In this example you can see that I have changed the initial parameters to be [4, 2.5]. I could also change the maximum number of iterations with the keyword 'iters' and the step size with 'stepsize'.

Using the above code with the default step size of 0.5 and maximum iterations of 100 I return the best fit parameters of [5, 5] which match our "Data" parameters and means we have found the global minimum.

The path taken through the parameter space is shown in the figure below. The colour bar shows the normalised value of the objective function for each parameter combination and the black line shows the path being taken by the descent algorithm through this space. The star signifies the minimum value of $X^2$ found by the routine.
This looks like a promising algorithm. However, it can run into problems when there are multiple local minima. A local minima is a point in the parameter space in which the surrounding points all have larger associated $X^2$ values but it's $X^2$ value is itself larger than that of the global minimum. A descent algorithm like the one described here, or indeed one that uses the gradient to traverse through the space, can easily fall into local minima and consequently give poor fits.

I chose the simple 'data' and fitted a model that exactly described that of the 'data' in order to prevent the presence of any local minima. However, even in this simple case we can identify local minima. If I run the code with initial parameters of [4, 8] we can see this happening in the figure below.
 Here we return parameters of [1, 5.5] and we can see in the following that this is a poor fit to the 'data' defined at the very beginning of this article.

Descent algorithms of any form tend to be sensitive to the initial parameters when local minima are present in the parameter space and consequently they are not often used in these instances.  Estimates can occasionally improve with finer step sizes and better stepping algorithms. I will attempt to explore this further in another post but again these alterations do not prevent the routine finishing in local minima.

Alternative optimisation routines include Basinhopping and Simulated Annealing which allow for a larger exploration of the parameter space or allow the routine to move "up hill" with a given probability and escape local minima. I hope to post some articles about both these routines soon!

Thanks for reading!

Comments

Popular posts from this blog

Random Number Generation: Box-Muller Transform

Knowing how to generate random numbers is a key tool in any data scientists tool box. They appear in multiple different optimisation routines and machine learning processes. However, we often use random number generators built into programming languages without thinking about what is happening below the surface. For example in Python if I want to generate a random number uniformally distributed between 0 and 1 all I need to do is import numpy and use the np.random.uniform() function. Similarly if I want gaussian random numbers to for example simulate random noise in an experiment all I need to do is use np.random.normal(). But what is actually happening when I call these functions? and how do I go about generating random numbers from scratch? This is the first of hopefully a number of blog posts on the subject of random numbers and generating random numbers. There are multiple different methods that can be used in order to do this such as the inverse probability transform method and I

LDL Decomposition with Python

I recently wrote a post on the Cholesky decomposition of a matrix. You can read more about the Cholesky decomposition here;  https://harrybevins.blogspot.com/2020/04/cholesky-decomposition-and-identity.html . A closely related and more stable decomposition is the LDL decomposition which has the form, $\textbf{Q} = \textbf{LDL*}$, where $\textbf{L}$ is a lower triangular matrix with diagonal entries equal to 1, $\textbf{L*}$ is it's complex conjugate and $\textbf{D}$ is a diagonal matrix. Again an LDL decomposition can be performed using the Scipy or numpy linear algebra packages but it is a far more rewarding experience to write the code. This also often leads to a better understanding of what is happening during this decomposition. The relationship between the two decompositions, Cholesky and LDL, can be expressed like so, $\textbf{Q} = \textbf{LDL*} = \textbf{LD}^{1/2}(\textbf{D})^{1/2}\textbf{*L*} = \textbf{LD}^{1/2}(\textbf{LD}^{1/2})\textbf{*}$. A simple way to calcu

Random Number Generation: Inverse Transform Sampling with Python

Following on from my previous post, in which I showed how to generate random normally distributed numbers using the Box-Muller Transform, I want to demonstrate how Inverse Transform Sampling(ITS) can be used to generate random exponentially distributed numbers. The description of the Box-Muller Transform can be found here:  https://astroanddata.blogspot.com/2020/06/random-number-generation-box-muller.html . As discussed in my previous post random numbers appear everywhere in data analysis and knowing how to generate them is an important part of any data scientists tool box. ITS takes a sample of uniformly distributed numbers and maps them onto a chosen probability density function via the cumulative distribution function (CDF). In our case the chosen probability density function is for an exponential distribution given by, $P_d(x) = \lambda \exp(-\lambda x)$. This is a common distribution that describes events that occur independently, continuously and with an average constant rate, $\