The Binary Search is an important algorithm for any Data Analyst to know. Frequently, we may need to identify if a particular value can be found in a data set and if it can we may need to know it's index. For example we may have some conditional task that only gets acted upon if our data set features a specific value.
Perhaps, another example, we have collected some data but we know having plotted it in some informative way that there is a spurious value in the set that isn't following the expected trend. We can most likely attribute it to a typo, 349 instead of 345 say, when entering our data but either way we consider it an outlier and we need to identify it's position in the list and remove it from the data set. A possible solution is to loop through our data set until we find the entry that equals 349 but our data set could be big and this could be very time consuming. The alternative is to use the Binary Search which allows us to find the index of our typo in a sorted lists quickly and easily with far less comparisons.
The algorithm itself is fairly straightforward and easy to implement. The list must be sorted from lowest value at index 0 to highest value at the final index. We then have a target value we wish to find which we call $T$ and the algorithm essentially compares $T$ to the middle value, $m$ of the list. The half of the list in which $T$ cannot be found is then discarded and the process repeated until $m = T$.
More succinctly the algorithm is given as,
- Define $L$ to be index 0 and $H$ to be the final index in the list.
- Define $m$ to be the floor division, $\frac{L + H}{2}$ such that $m$ is an integer index at the centre of the list.
- Compare the values of $m$ and $T$ the target. If $T < m$ then set $H = m -1$ and if $T > m$ set $L = m + 1$.
- Repeat from step 2 until $m = T$. If $L > H$ then the search is unsuccessful and $T$ is not to be found in the list.
We can implement this in Python using the following code,
def searching(x, T): L = 0 H = len(x) - 1 while L <= H: m = (L + H)//2 if x[m] < T: L = m + 1 elif x[m] > T: H = m - 1 else: return m, True return None, False
Here $x$ is our sorted list and the function is returning the index, $m$ of the target element, $T$ if it is present in the data set. We also return a boolean argument of 'True' in these instances so that we can use the function for conditional operations that are dependent on the presence of $T$. If $L > H$ then the while loop is exited and we return 'None' for the index and the 'False' as the element $T$ is not present in the list.
We can test that this code works and I do this by generating 100 uniformly distributed random numbers between 0 and 100. I then append my target value of 16 into the list so that I know it is present and sort the list using the Quicksort algorithm I wrote here. I also re-write the above function into a callable class and this can be found on my github linked at the end of the article.
The graph below shows the algorithm in action after passing my sorted list and the target value of 16 to the Binary Search function. The yellow bars show the limiting values, $L$ and $H$ of the search space and the red bar shows the value of $m$. The light shaded blue bars show the elements of the list that are excluded from the search as the algorithm progresses and you can see that the search space is halved at every iteration.
You can perform a Binary Search on any sortable data set. In other words if you can find a consistent way to translate strings into sortable numerical values then a Binary Search can be used to identify if a phrase is present on a webpage for example.
Similarly, modifications can be made to the algorithm to identify if there are any values in the list close to if not equal to the target value, $T$. For example, we could exit the while loop using something like,
np.isclose(m, T, rtol=rtol, atol=atol),to identify an element in the array that is within the stated relative and absolute tolerances of $T$. This is useful when the list entries are high precision but we are only interested in whether or not there is a value that approximates the target.
There is a lot more that can be done with this algorithm and there are a number of improvements to the speed of the algorithm that can be made. The algorithm has a $\mathcal{O}(\log(N))$ time complexity which is itself very quick but it can be laborious when the list we are searching has a large number of entries, $N$.
I hope that this brief introduction to the Binary Search has been interesting. I may follow it up with some alternative search algorithms and/or modifications in the future. The code for the project can be found here.
Thanks for reading!! Comments are welcome!
More interesting reading:
Comments
Post a Comment