Skip to main content

Selection Sort With Python

Sorting algorithms are an important aspect of data analysis and many operations require sorted data sets. There an abundance of different types of sorting algorithms such as Quicksort, Merge sort and Selection sort to name a few. I thought I would try and learn how to perform a few of these and I am starting with the Selection sort as it is one of the simplest and easiest to implement.

The selection sort algorithm is given as follows:
  1. Find the minimum (or maximum) of an unsorted list.
  2. Remove the minimum (or maximum) and append it as the first element in the sorted list.
  3. Repeat until there are no elements left in the unsorted list.
Fairly straight forward right? The algorithms simplicity means that it is often favoured over complex alternatives when efficiency is not an issue. It is not a particularly efficient algorithm and the time taken to perform this algorithm goes like the square of the number of data points, $N$. In other words it has $\mathcal{O}(N^2)$ time complexity. Don't worry about this too much now as I will demonstrate and explain it in more detail at the end of the article.

So how do we implement this in Python? Well the first thing we need to do is devise a way to calculate the minimum value in our unsorted list. Now there are many build in functions to do this in Python and a common method is just to use the numpy np.min() function. But let's not turn down the chance to do this explicitly. We can define a function like so,
def minimum(list):
    # Instead of np.min() just because we can.
    for i in range(len(list)):
        if i == 0:
            min = list[i]
        elif list[i] < min:
            min = list[i]

    return min
Here the function takes in a list and begins by declaring the minimum as the first element. It then checks if the next element is less than the current minimum and replaces the current minimum if the condition is evaluated as true. This process repeats until there are no more elements to look through and the minimum value in the list has been found. Nice and simple. 

In reality if I was writing this algorithm for work I would use np.min() as it means my code would be less likely to feature errors. In fact if I needed a sorted array I would actually just use np.sort() perhaps with some careful decision over which algorithm to implement via this function. But there is no fun in that and it's always a good idea to make sure we know what is happening underneath the surface of standard Python functions.

Now we have a function to find the minimum of a list we can perform our sort. This is done with the following 'while' loop,
interim_list = unsorted_list
sorted_list = []
while len(interim_list) > 0:
    min = minimum(interim_list)
    sorted_list.append(min)
    interim_list.remove(min)

Here we find the minimum of our interim list append it to the sorted list and remove it. The interim list is essentially what is left of the unsorted list after each iteration and the while loop continues until there are no elements left in the interim (unsorted) list.

That's it! It's very easy to implement and we don't even need to use any additional Python packages as the standard functions are sufficient. I have wrapped this into a callable class and the code can be found on my github via the link at the end of the article. We can see the algorithm working in the figure below where we are sorting a list of length 100 filled with random uniformly distributed numbers across the range 0 - 100.

The demonstration above has been slowed down and in fact to sort this list takes approximately 0.0006 seconds. We can look at how the time taken scales with $N$ simply by running multiple sorts on lists of varying length and timing the process. This is illustrated in the figure below and we see the expected approximate $N^2$ relationship I mentioned at the beginning of the article.


$\mathcal{O}(N^2)$ is whats known as 'Big O Notation' and all it means is that the time taken to perform the selection sort is proportional to the number of data points squared. So in the above graph the blue lines shows the actual recorded times for sorts to lists of varying size and the orange line shows the proportionality between $N$ and the time taken. In mathematical terms the orange line is given by,

$t = \bigg(\frac{N^2}{N_0^2}\bigg) t_0$,

where $t_0$ and $N_0$ refer to some known reference point. 

I hope to follow up this post with some alternative sorting algorithms and as I learn new algorithms I will add to the above graph for comparison.

Thanks for reading! Enjoy sorting!


Comments

Popular posts from this blog

Random Number Generation: Box-Muller Transform

Knowing how to generate random numbers is a key tool in any data scientists tool box. They appear in multiple different optimisation routines and machine learning processes. However, we often use random number generators built into programming languages without thinking about what is happening below the surface. For example in Python if I want to generate a random number uniformally distributed between 0 and 1 all I need to do is import numpy and use the np.random.uniform() function. Similarly if I want gaussian random numbers to for example simulate random noise in an experiment all I need to do is use np.random.normal(). But what is actually happening when I call these functions? and how do I go about generating random numbers from scratch? This is the first of hopefully a number of blog posts on the subject of random numbers and generating random numbers. There are multiple different methods that can be used in order to do this such as the inverse probability transform method and I

LDL Decomposition with Python

I recently wrote a post on the Cholesky decomposition of a matrix. You can read more about the Cholesky decomposition here;  https://harrybevins.blogspot.com/2020/04/cholesky-decomposition-and-identity.html . A closely related and more stable decomposition is the LDL decomposition which has the form, $\textbf{Q} = \textbf{LDL*}$, where $\textbf{L}$ is a lower triangular matrix with diagonal entries equal to 1, $\textbf{L*}$ is it's complex conjugate and $\textbf{D}$ is a diagonal matrix. Again an LDL decomposition can be performed using the Scipy or numpy linear algebra packages but it is a far more rewarding experience to write the code. This also often leads to a better understanding of what is happening during this decomposition. The relationship between the two decompositions, Cholesky and LDL, can be expressed like so, $\textbf{Q} = \textbf{LDL*} = \textbf{LD}^{1/2}(\textbf{D})^{1/2}\textbf{*L*} = \textbf{LD}^{1/2}(\textbf{LD}^{1/2})\textbf{*}$. A simple way to calcu

Random Number Generation: Inverse Transform Sampling with Python

Following on from my previous post, in which I showed how to generate random normally distributed numbers using the Box-Muller Transform, I want to demonstrate how Inverse Transform Sampling(ITS) can be used to generate random exponentially distributed numbers. The description of the Box-Muller Transform can be found here:  https://astroanddata.blogspot.com/2020/06/random-number-generation-box-muller.html . As discussed in my previous post random numbers appear everywhere in data analysis and knowing how to generate them is an important part of any data scientists tool box. ITS takes a sample of uniformly distributed numbers and maps them onto a chosen probability density function via the cumulative distribution function (CDF). In our case the chosen probability density function is for an exponential distribution given by, $P_d(x) = \lambda \exp(-\lambda x)$. This is a common distribution that describes events that occur independently, continuously and with an average constant rate, $\