Sorting algorithms are an important aspect of data analysis and many operations require sorted data sets. There an abundance of different types of sorting algorithms such as Quicksort, Merge sort and Selection sort to name a few. I thought I would try and learn how to perform a few of these and I am starting with the Selection sort as it is one of the simplest and easiest to implement.
The selection sort algorithm is given as follows:
- Find the minimum (or maximum) of an unsorted list.
- Remove the minimum (or maximum) and append it as the first element in the sorted list.
- Repeat until there are no elements left in the unsorted list.
Fairly straight forward right? The algorithms simplicity means that it is often favoured over complex alternatives when efficiency is not an issue. It is not a particularly efficient algorithm and the time taken to perform this algorithm goes like the square of the number of data points, $N$. In other words it has $\mathcal{O}(N^2)$ time complexity. Don't worry about this too much now as I will demonstrate and explain it in more detail at the end of the article.
So how do we implement this in Python? Well the first thing we need to do is devise a way to calculate the minimum value in our unsorted list. Now there are many build in functions to do this in Python and a common method is just to use the numpy np.min() function. But let's not turn down the chance to do this explicitly. We can define a function like so,
def minimum(list): # Instead of np.min() just because we can. for i in range(len(list)): if i == 0: min = list[i] elif list[i] < min: min = list[i] return min
Here the function takes in a list and begins by declaring the minimum as the first element. It then checks if the next element is less than the current minimum and replaces the current minimum if the condition is evaluated as true. This process repeats until there are no more elements to look through and the minimum value in the list has been found. Nice and simple.
In reality if I was writing this algorithm for work I would use np.min() as it means my code would be less likely to feature errors. In fact if I needed a sorted array I would actually just use np.sort() perhaps with some careful decision over which algorithm to implement via this function. But there is no fun in that and it's always a good idea to make sure we know what is happening underneath the surface of standard Python functions.
Now we have a function to find the minimum of a list we can perform our sort. This is done with the following 'while' loop,
interim_list = unsorted_list sorted_list = [] while len(interim_list) > 0: min = minimum(interim_list) sorted_list.append(min) interim_list.remove(min)
Here we find the minimum of our interim list append it to the sorted list and remove it. The interim list is essentially what is left of the unsorted list after each iteration and the while loop continues until there are no elements left in the interim (unsorted) list.
That's it! It's very easy to implement and we don't even need to use any additional Python packages as the standard functions are sufficient. I have wrapped this into a callable class and the code can be found on my github via the link at the end of the article. We can see the algorithm working in the figure below where we are sorting a list of length 100 filled with random uniformly distributed numbers across the range 0 - 100.
The demonstration above has been slowed down and in fact to sort this list takes approximately 0.0006 seconds. We can look at how the time taken scales with $N$ simply by running multiple sorts on lists of varying length and timing the process. This is illustrated in the figure below and we see the expected approximate $N^2$ relationship I mentioned at the beginning of the article.
$\mathcal{O}(N^2)$ is whats known as 'Big O Notation' and all it means is that the time taken to perform the selection sort is proportional to the number of data points squared. So in the above graph the blue lines shows the actual recorded times for sorts to lists of varying size and the orange line shows the proportionality between $N$ and the time taken. In mathematical terms the orange line is given by,
$t = \bigg(\frac{N^2}{N_0^2}\bigg) t_0$,
where $t_0$ and $N_0$ refer to some known reference point.
I hope to follow up this post with some alternative sorting algorithms and as I learn new algorithms I will add to the above graph for comparison.
Thanks for reading! Enjoy sorting!
The code for this project can be found here: https://github.com/htjb/Blog/tree/master/sorting_algorithms/selection_sort
Comments
Post a Comment