Skip to main content

Selection Sort With Python

Sorting algorithms are an important aspect of data analysis and many operations require sorted data sets. There an abundance of different types of sorting algorithms such as Quicksort, Merge sort and Selection sort to name a few. I thought I would try and learn how to perform a few of these and I am starting with the Selection sort as it is one of the simplest and easiest to implement.

The selection sort algorithm is given as follows:
  1. Find the minimum (or maximum) of an unsorted list.
  2. Remove the minimum (or maximum) and append it as the first element in the sorted list.
  3. Repeat until there are no elements left in the unsorted list.
Fairly straight forward right? The algorithms simplicity means that it is often favoured over complex alternatives when efficiency is not an issue. It is not a particularly efficient algorithm and the time taken to perform this algorithm goes like the square of the number of data points, $N$. In other words it has $\mathcal{O}(N^2)$ time complexity. Don't worry about this too much now as I will demonstrate and explain it in more detail at the end of the article.

So how do we implement this in Python? Well the first thing we need to do is devise a way to calculate the minimum value in our unsorted list. Now there are many build in functions to do this in Python and a common method is just to use the numpy np.min() function. But let's not turn down the chance to do this explicitly. We can define a function like so,
def minimum(list):
    # Instead of np.min() just because we can.
    for i in range(len(list)):
        if i == 0:
            min = list[i]
        elif list[i] < min:
            min = list[i]

    return min
Here the function takes in a list and begins by declaring the minimum as the first element. It then checks if the next element is less than the current minimum and replaces the current minimum if the condition is evaluated as true. This process repeats until there are no more elements to look through and the minimum value in the list has been found. Nice and simple. 

In reality if I was writing this algorithm for work I would use np.min() as it means my code would be less likely to feature errors. In fact if I needed a sorted array I would actually just use np.sort() perhaps with some careful decision over which algorithm to implement via this function. But there is no fun in that and it's always a good idea to make sure we know what is happening underneath the surface of standard Python functions.

Now we have a function to find the minimum of a list we can perform our sort. This is done with the following 'while' loop,
interim_list = unsorted_list
sorted_list = []
while len(interim_list) > 0:
    min = minimum(interim_list)
    sorted_list.append(min)
    interim_list.remove(min)

Here we find the minimum of our interim list append it to the sorted list and remove it. The interim list is essentially what is left of the unsorted list after each iteration and the while loop continues until there are no elements left in the interim (unsorted) list.

That's it! It's very easy to implement and we don't even need to use any additional Python packages as the standard functions are sufficient. I have wrapped this into a callable class and the code can be found on my github via the link at the end of the article. We can see the algorithm working in the figure below where we are sorting a list of length 100 filled with random uniformly distributed numbers across the range 0 - 100.

The demonstration above has been slowed down and in fact to sort this list takes approximately 0.0006 seconds. We can look at how the time taken scales with $N$ simply by running multiple sorts on lists of varying length and timing the process. This is illustrated in the figure below and we see the expected approximate $N^2$ relationship I mentioned at the beginning of the article.


$\mathcal{O}(N^2)$ is whats known as 'Big O Notation' and all it means is that the time taken to perform the selection sort is proportional to the number of data points squared. So in the above graph the blue lines shows the actual recorded times for sorts to lists of varying size and the orange line shows the proportionality between $N$ and the time taken. In mathematical terms the orange line is given by,

$t = \bigg(\frac{N^2}{N_0^2}\bigg) t_0$,

where $t_0$ and $N_0$ refer to some known reference point. 

I hope to follow up this post with some alternative sorting algorithms and as I learn new algorithms I will add to the above graph for comparison.

Thanks for reading! Enjoy sorting!


Comments

Popular posts from this blog

LDL Decomposition with Python

I recently wrote a post on the Cholesky decomposition of a matrix. You can read more about the Cholesky decomposition here;  https://harrybevins.blogspot.com/2020/04/cholesky-decomposition-and-identity.html . A closely related and more stable decomposition is the LDL decomposition which has the form, $\textbf{Q} = \textbf{LDL*}$, where $\textbf{L}$ is a lower triangular matrix with diagonal entries equal to 1, $\textbf{L*}$ is it's complex conjugate and $\textbf{D}$ is a diagonal matrix. Again an LDL decomposition can be performed using the Scipy or numpy linear algebra packages but it is a far more rewarding experience to write the code. This also often leads to a better understanding of what is happening during this decomposition. The relationship between the two decompositions, Cholesky and LDL, can be expressed like so, $\textbf{Q} = \textbf{LDL*} = \textbf{LD}^{1/2}(\textbf{D})^{1/2}\textbf{*L*} = \textbf{LD}^{1/2}(\textbf{LD}^{1/2})\textbf{*}$. A simple way to calcu...

Subplots with Matplotlib

As already discussed in some of my previous articles good visualisation of data is essential to getting the associated message across. One aspect of this is the need to plot multiple data sets or visualise the same data set in different ways on the same figure. For example we may wish to illustrate our data and the residuals after we subtract a fit to that data all in the same figure. This can be effectively done using matplotlib and the associated subplots environments. Mastery of these tools is something that comes very much with practice and I do not claim to be an expert. However, I have some experience with the environment and I will share with you the basics in this article. In order to use the matplotlib environment we will need to begin by importing matplotlib via, import matplotlib.pyplot as plt We can then proceed to explore what is on offer. plt.subplot() and plt.subplots() So plt.subplots() and plt.subplot() are probably where most people begin to learn about the idea of c...

Random Number Generation: Box-Muller Transform

Knowing how to generate random numbers is a key tool in any data scientists tool box. They appear in multiple different optimisation routines and machine learning processes. However, we often use random number generators built into programming languages without thinking about what is happening below the surface. For example in Python if I want to generate a random number uniformally distributed between 0 and 1 all I need to do is import numpy and use the np.random.uniform() function. Similarly if I want gaussian random numbers to for example simulate random noise in an experiment all I need to do is use np.random.normal(). But what is actually happening when I call these functions? and how do I go about generating random numbers from scratch? This is the first of hopefully a number of blog posts on the subject of random numbers and generating random numbers. There are multiple different methods that can be used in order to do this such as the inverse probability transform method and I ...