There are a large number of different optimisation routines being used for fitting data sets and some are more effective than others. A simple descent based algorithm can be produced using Python and I aim in this post to describe how this can be done. I will also consider the advantages and disadvantages of this method.
The basic idea behind this algorithm is to step around the parameter space of our model and move in a direction that best improves the value of our objective function until no further improvement can be made. In this instance I will be using $X^2$ as my objective function which is given by,
$X^2 = \sum_i(y_i - y_{fitted, i})^2.$
The stepping routine can be defined in many different ways. For example it can be defined based on the derivative of the fitted model, known as a Gradient Descent, or on a continually updated parameter step with checks in every direction in the parameter space. For this example I will use a fixed parameter step but the code can be easily modified later to adjust this routine.
I will be attempting to fit the following data using this descent algorithm,
For this example I am sticking to a 2-D parameter space. This method is possible in higher dimensions but the possible stepping directions increase dramatically with increasing dimensions. In turn this increases run time.
I want to write this code as a callable class so that I can re-use it in the future with only minor modifications. I begin by defining the class and the relevant variables needed,
Here 'model' is a function that returns my fitted y for a given set of parameters. In this instance my model is, for simplicity, $y=P_0 x+P_1 x^2$ and this is coded into a function in a separate script that will eventually call my class as,
'stepsize' defines the amount by which we wish to move each parameter at each iteration. By using kwargs for 'iters', 'initial_parameters' and 'stepsize' we allow the user to adjust the parameters of the algorithm whilst also being able to set reasonable default values. The definition of 'iters' and 'initial_parameters' will become apparent as we build the algorithm. In the class definition the line,
where self.fit() refers to the function inside our 'descent' class that we will perform the fitting routine. We would like this function to return the best fitting parameters, 'best_params' and for illustrative plotting purposes we would like to return the path through the parameter space that the routine has taken.
We need some initial guess at our best fitting parameters and an associated evaluation of the objective function. Therefore we begin our fitting function with,
We define our initial $X^2$ value as 'f_old' in the hope that we can update this by stepping around the parameter space. We also need to define a 'best_params' variable which can be updated as the algorithm progresses. This 'best_params' variable is the starting point of our path through the parameter space and is just given by our initial parameter guesses.
Our path will be a list of parameter sets and as such we define it as above. Our algorithm needs to check steps in all possible directions for it to work. It will continue to check in all possible directions as long as it can improve the objective function evaluation. This can be achieved with a while loop,
'f_new' will be the best evaluation of the objective function at each iteration and the program will continue through the while loop as long as this is smaller than the previous best value, 'f_old' which we replace with 'f_new' at each iteration. The count variable is used to make sure we never exceed 'self.iters' defined above which is the maximum allowed number of iterations. When 'count == self.iters' we break out of the while loop with,
which is the final block of code in the loop. To check the value of the objective function in each direction we have to step our parameters in all 8 possible surrounding directions. This is done with the following block of code inside the while loop,
Each new parameter set is built from the previous best and we store both the tested parameters and associated objective function evaluation as we step in different directions. At the end of the while loop before we check the total number of iterations hasn't reached the allowed maximum we have,
Here we take the minimum of the evaluated objective functions at each iteration and assign it to 'f_new'. It's size remember determines whether we remain in the while loop or not. If we are set to remain in the while loop, i.e. 'f_new < f_old', we update our best parameter estimate with the parameters associated with 'f_new'.
The class is finished by returning from the 'fit' function the 'best_params' variable and the 'path' list.
The full code for this project, including the code to produce the following plots, is available at https://github.com/htjb/Blog/tree/master/Steepest_descent.
I call the class in a separate program shown below,
Using the above code with the default step size of 0.5 and maximum iterations of 100 I return the best fit parameters of [5, 5] which match our "Data" parameters and means we have found the global minimum.
The path taken through the parameter space is shown in the figure below. The colour bar shows the normalised value of the objective function for each parameter combination and the black line shows the path being taken by the descent algorithm through this space. The star signifies the minimum value of $X^2$ found by the routine.
This looks like a promising algorithm. However, it can run into problems when there are multiple local minima. A local minima is a point in the parameter space in which the surrounding points all have larger associated $X^2$ values but it's $X^2$ value is itself larger than that of the global minimum. A descent algorithm like the one described here, or indeed one that uses the gradient to traverse through the space, can easily fall into local minima and consequently give poor fits.
I chose the simple 'data' and fitted a model that exactly described that of the 'data' in order to prevent the presence of any local minima. However, even in this simple case we can identify local minima. If I run the code with initial parameters of [4, 8] we can see this happening in the figure below.
Here we return parameters of [1, 5.5] and we can see in the following that this is a poor fit to the 'data' defined at the very beginning of this article.
Descent algorithms of any form tend to be sensitive to the initial parameters when local minima are present in the parameter space and consequently they are not often used in these instances. Estimates can occasionally improve with finer step sizes and better stepping algorithms. I will attempt to explore this further in another post but again these alterations do not prevent the routine finishing in local minima.
Alternative optimisation routines include Basinhopping and Simulated Annealing which allow for a larger exploration of the parameter space or allow the routine to move "up hill" with a given probability and escape local minima. I hope to post some articles about both these routines soon!
Thanks for reading!
The basic idea behind this algorithm is to step around the parameter space of our model and move in a direction that best improves the value of our objective function until no further improvement can be made. In this instance I will be using $X^2$ as my objective function which is given by,
$X^2 = \sum_i(y_i - y_{fitted, i})^2.$
The stepping routine can be defined in many different ways. For example it can be defined based on the derivative of the fitted model, known as a Gradient Descent, or on a continually updated parameter step with checks in every direction in the parameter space. For this example I will use a fixed parameter step but the code can be easily modified later to adjust this routine.
I will be attempting to fit the following data using this descent algorithm,
x = np.linspace(0, 10, 100) y = 5*x + 5*x**2
For this example I am sticking to a 2-D parameter space. This method is possible in higher dimensions but the possible stepping directions increase dramatically with increasing dimensions. In turn this increases run time.
I want to write this code as a callable class so that I can re-use it in the future with only minor modifications. I begin by defining the class and the relevant variables needed,
class descent(object): def __init__(self, x, y, model, **kwargs): self.x = x self.y = y self.model = model self.iters = kwargs.pop('iters', 100) self.initial_parameters = kwargs.pop('initial_parameters', [1, 1]) self.stepsize = kwargs.pop('stepsize', 0.5) self.best_params, self.path = self.fit()
Here 'model' is a function that returns my fitted y for a given set of parameters. In this instance my model is, for simplicity, $y=P_0 x+P_1 x^2$ and this is coded into a function in a separate script that will eventually call my class as,
def model(x, y, params): y_fit = params[0] * x + params[1] *x**2 return y_fit
'stepsize' defines the amount by which we wish to move each parameter at each iteration. By using kwargs for 'iters', 'initial_parameters' and 'stepsize' we allow the user to adjust the parameters of the algorithm whilst also being able to set reasonable default values. The definition of 'iters' and 'initial_parameters' will become apparent as we build the algorithm. In the class definition the line,
self.best_params, self.path = self.fit()
where self.fit() refers to the function inside our 'descent' class that we will perform the fitting routine. We would like this function to return the best fitting parameters, 'best_params' and for illustrative plotting purposes we would like to return the path through the parameter space that the routine has taken.
We need some initial guess at our best fitting parameters and an associated evaluation of the objective function. Therefore we begin our fitting function with,
def fit(self): initial_fit = self.model(self.x, self.y, self.initial_parameters) f_old = np.sum((self.y - initial_fit)**2)
We define our initial $X^2$ value as 'f_old' in the hope that we can update this by stepping around the parameter space. We also need to define a 'best_params' variable which can be updated as the algorithm progresses. This 'best_params' variable is the starting point of our path through the parameter space and is just given by our initial parameter guesses.
best_params = self.initial_parameters path = [] path.append(best_params)
Our path will be a list of parameter sets and as such we define it as above. Our algorithm needs to check steps in all possible directions for it to work. It will continue to check in all possible directions as long as it can improve the objective function evaluation. This can be achieved with a while loop,
f_new = 0 count = 0 while f_new < f_old: if f_new != 0: f_old = f_new count +=1
'f_new' will be the best evaluation of the objective function at each iteration and the program will continue through the while loop as long as this is smaller than the previous best value, 'f_old' which we replace with 'f_new' at each iteration. The count variable is used to make sure we never exceed 'self.iters' defined above which is the maximum allowed number of iterations. When 'count == self.iters' we break out of the while loop with,
if count == self.iters: print('Maximum iterations reached') break
which is the final block of code in the loop. To check the value of the objective function in each direction we have to step our parameters in all 8 possible surrounding directions. This is done with the following block of code inside the while loop,
f = [] tested_params = [] # stepping right and up for i in range(len(self.initial_parameters)): new_params = best_params.copy() new_params[i] = best_params.copy()[i] + self.stepsize fit = self.model(self.x, self.y, new_params) f.append(np.sum((self.y - fit)**2)) tested_params.append(new_params) # stepping left and down for i in range(len(new_params)): new_params = best_params.copy() new_params[i] = best_params.copy()[i] - self.stepsize fit = self.model(self.x, self.y, new_params) f.append(np.sum((self.y - fit)**2)) tested_params.append(new_params) # stepping diagonally up to the right new_params = [best_params.copy()[i] + self.stepsize for i in range(len(best_params))] fit = self.model(self.x, self.y, new_params) f.append(np.sum((self.y - fit)**2)) tested_params.append(new_params) # stepping diagonally down to the left new_params = [best_params.copy()[i] - self.stepsize for i in range(len(best_params))] fit = self.model(self.x, self.y, new_params) f.append(np.sum((self.y - fit)**2)) tested_params.append(new_params) # stepping diagonally up to the left new_params = np.empty(len(best_params)) for i in range(len(new_params)): if i%2: new_params[i] = best_params.copy()[i] + self.stepsize if i%2 == 0: new_params[i] = best_params.copy()[i] - self.stepsize fit = self.model(self.x, self.y, new_params) f.append(np.sum((self.y - fit)**2)) tested_params.append(new_params) # stepping diagonally down to the right new_params = np.empty(len(best_params)) for i in range(len(new_params)): if i%2: new_params[i] = best_params.copy()[i] - self.stepsize if i%2 == 0: new_params[i] = best_params.copy()[i] + self.stepsize fit = self.model(self.x, self.y, new_params) f.append(np.sum((self.y - fit)**2)) tested_params.append(new_params)
Each new parameter set is built from the previous best and we store both the tested parameters and associated objective function evaluation as we step in different directions. At the end of the while loop before we check the total number of iterations hasn't reached the allowed maximum we have,
f_new = min(f) if f_new < f_old: for i in range(len(f)): if f[i] == f_new: best_params = tested_params[i] path.append(best_params)
Here we take the minimum of the evaluated objective functions at each iteration and assign it to 'f_new'. It's size remember determines whether we remain in the while loop or not. If we are set to remain in the while loop, i.e. 'f_new < f_old', we update our best parameter estimate with the parameters associated with 'f_new'.
The class is finished by returning from the 'fit' function the 'best_params' variable and the 'path' list.
The full code for this project, including the code to produce the following plots, is available at https://github.com/htjb/Blog/tree/master/Steepest_descent.
I call the class in a separate program shown below,
import numpy as np from steepest_descent import descent x = np.linspace(0, 10, 100) y = 5*x + 5*x**2 def model(x, y, params): y_fit = params[0] * x + params[1] *x**2 return y_fit best_params = descent(x, y, model, initial_parameters=[4, 2.5]).best_params print(best_params)In this example you can see that I have changed the initial parameters to be [4, 2.5]. I could also change the maximum number of iterations with the keyword 'iters' and the step size with 'stepsize'.
Using the above code with the default step size of 0.5 and maximum iterations of 100 I return the best fit parameters of [5, 5] which match our "Data" parameters and means we have found the global minimum.
The path taken through the parameter space is shown in the figure below. The colour bar shows the normalised value of the objective function for each parameter combination and the black line shows the path being taken by the descent algorithm through this space. The star signifies the minimum value of $X^2$ found by the routine.
This looks like a promising algorithm. However, it can run into problems when there are multiple local minima. A local minima is a point in the parameter space in which the surrounding points all have larger associated $X^2$ values but it's $X^2$ value is itself larger than that of the global minimum. A descent algorithm like the one described here, or indeed one that uses the gradient to traverse through the space, can easily fall into local minima and consequently give poor fits.
I chose the simple 'data' and fitted a model that exactly described that of the 'data' in order to prevent the presence of any local minima. However, even in this simple case we can identify local minima. If I run the code with initial parameters of [4, 8] we can see this happening in the figure below.
Here we return parameters of [1, 5.5] and we can see in the following that this is a poor fit to the 'data' defined at the very beginning of this article.
Descent algorithms of any form tend to be sensitive to the initial parameters when local minima are present in the parameter space and consequently they are not often used in these instances. Estimates can occasionally improve with finer step sizes and better stepping algorithms. I will attempt to explore this further in another post but again these alterations do not prevent the routine finishing in local minima.
Alternative optimisation routines include Basinhopping and Simulated Annealing which allow for a larger exploration of the parameter space or allow the routine to move "up hill" with a given probability and escape local minima. I hope to post some articles about both these routines soon!
Thanks for reading!
Comments
Post a Comment