Skip to main content

Binary Search with Python

The Binary Search is an important algorithm for any Data Analyst to know. Frequently, we may need to identify if a particular value can be found in a data set and if it can we may need to know it's index. For example we may have some conditional task that only gets acted upon if our data set features a specific value. 

Perhaps, another example, we have collected some data but we know having plotted it in some informative way that there is a spurious value in the set that isn't following the expected trend. We can most likely attribute it to a typo, 349 instead of 345 say, when entering our data but either way we consider it an outlier and we need to identify it's position in the list and remove it from the data set. A possible solution is to loop through our data set until we find the entry that equals 349 but our data set could be big and this could be very time consuming. The alternative is to use the Binary Search which allows us to find the index of our typo in a sorted lists quickly and easily with far less comparisons.

The algorithm itself is fairly straightforward and easy to implement. The list must be sorted from lowest value at index 0 to highest value at the final index. We then have a target value we wish to find which we call $T$ and the algorithm essentially compares $T$ to the middle value, $m$ of the list. The half of the list in which $T$ cannot be found is then discarded and the process repeated until $m = T$.

More succinctly the algorithm is given as,

  1. Define $L$ to be index 0 and $H$ to be the final index in the list.
  2. Define $m$ to be the floor division, $\frac{L + H}{2}$ such that $m$ is an integer index at the centre of the list.
  3. Compare the values of $m$ and $T$ the target. If $T < m$ then set $H = m -1$ and if $T > m$ set $L = m + 1$.
  4. Repeat from step 2 until $m = T$. If $L > H$ then the search is unsuccessful and $T$ is not to be found in the list.
We can implement this in Python using the following code,
def searching(x, T):
    L = 0
    H = len(x) - 1

    while L <= H:
        m = (L + H)//2
        if x[m] < T:
            L = m + 1
        elif x[m] > T:
            H = m - 1
        else: return m, True
    return None, False
Here $x$ is our sorted list and the function is returning the index, $m$ of the target element, $T$ if it is present in the data set. We also return a boolean argument of 'True' in these instances so that we can use the function for conditional operations that are dependent on the presence of $T$. If $L > H$ then the while loop is exited and we return 'None' for the index and the 'False' as the element $T$ is not present in the list.

We can test that this code works and I do this by generating 100 uniformly distributed random numbers between 0 and 100. I then append my target value of 16 into the list so that I know it is present and sort the list using the Quicksort algorithm I wrote here. I also re-write the above function into a callable class and this can be found on my github linked at the end of the article.

The graph below shows the algorithm in action after passing my sorted list and the target value of 16 to the Binary Search function. The yellow bars show the limiting values, $L$ and $H$ of the search space and the red bar shows the value of $m$. The light shaded blue bars show the elements of the list that are excluded from the search as the algorithm progresses and you can see that the search space is halved at every iteration.
You can perform a Binary Search on any sortable data set. In other words if you can find a consistent way to translate strings into sortable numerical values then a Binary Search can be used to identify if a phrase is present on a webpage for example.

Similarly, modifications can be made to the algorithm to identify if there are any values in the list close to if not equal to the target value, $T$.  For example, we could exit the while loop using something like,
np.isclose(m, T, rtol=rtol, atol=atol),
to identify an element in the array that is within the stated relative and absolute tolerances of $T$. This is useful when the list entries are high precision but we are only interested in whether or not there is a value that approximates the target.

There is a lot more that can be done with this algorithm and there are a number of improvements to the speed of the algorithm that can be made. The algorithm has a $\mathcal{O}(\log(N))$ time complexity which is itself very quick but it can be laborious when the list we are searching has a large number of entries, $N$.

I hope that this brief introduction to the Binary Search has been interesting. I may follow it up with some alternative search algorithms and/or modifications in the future. The code for the project can be found here.

Thanks for reading!! Comments are welcome!

More interesting reading:

Comments

Popular posts from this blog

LDL Decomposition with Python

I recently wrote a post on the Cholesky decomposition of a matrix. You can read more about the Cholesky decomposition here;  https://harrybevins.blogspot.com/2020/04/cholesky-decomposition-and-identity.html . A closely related and more stable decomposition is the LDL decomposition which has the form, $\textbf{Q} = \textbf{LDL*}$, where $\textbf{L}$ is a lower triangular matrix with diagonal entries equal to 1, $\textbf{L*}$ is it's complex conjugate and $\textbf{D}$ is a diagonal matrix. Again an LDL decomposition can be performed using the Scipy or numpy linear algebra packages but it is a far more rewarding experience to write the code. This also often leads to a better understanding of what is happening during this decomposition. The relationship between the two decompositions, Cholesky and LDL, can be expressed like so, $\textbf{Q} = \textbf{LDL*} = \textbf{LD}^{1/2}(\textbf{D})^{1/2}\textbf{*L*} = \textbf{LD}^{1/2}(\textbf{LD}^{1/2})\textbf{*}$. A simple way to calcu...

Subplots with Matplotlib

As already discussed in some of my previous articles good visualisation of data is essential to getting the associated message across. One aspect of this is the need to plot multiple data sets or visualise the same data set in different ways on the same figure. For example we may wish to illustrate our data and the residuals after we subtract a fit to that data all in the same figure. This can be effectively done using matplotlib and the associated subplots environments. Mastery of these tools is something that comes very much with practice and I do not claim to be an expert. However, I have some experience with the environment and I will share with you the basics in this article. In order to use the matplotlib environment we will need to begin by importing matplotlib via, import matplotlib.pyplot as plt We can then proceed to explore what is on offer. plt.subplot() and plt.subplots() So plt.subplots() and plt.subplot() are probably where most people begin to learn about the idea of c...

Random Number Generation: Box-Muller Transform

Knowing how to generate random numbers is a key tool in any data scientists tool box. They appear in multiple different optimisation routines and machine learning processes. However, we often use random number generators built into programming languages without thinking about what is happening below the surface. For example in Python if I want to generate a random number uniformally distributed between 0 and 1 all I need to do is import numpy and use the np.random.uniform() function. Similarly if I want gaussian random numbers to for example simulate random noise in an experiment all I need to do is use np.random.normal(). But what is actually happening when I call these functions? and how do I go about generating random numbers from scratch? This is the first of hopefully a number of blog posts on the subject of random numbers and generating random numbers. There are multiple different methods that can be used in order to do this such as the inverse probability transform method and I ...