How to add custom distance metric in DBSCAN

When you just specify the epsilon and min_samples values in DBSCAN, it uses the euclidean distance by default for computing the distance between the points. There are several other pre-defined options to choose from, like ‘manhattan’, ‘l1’, ‘l2’, ‘chebyshev’, ‘jaccard’, ‘minowski’, etc. You can get the list of predefined options here. What each option represents is beyond the scope of this article. You can either give a quick Google search to understand these options or go back to your Linear Algebra notes.

However, if none of these options satisfy your requirements, then you can define a custom function. Let’s see how that can be done.

You essentially define a function that takes two N-Dimensional vectors as input. N is the number of dimensions in your input data. So it will be 2 for two-dimensional data, 3 for three-dimensional data, and so on. Let’s assume that we are dealing with 3-dimensional data, and we want to give higher weightage to the 3rd dimension’s distance. Thus, we construct a new function as follows:

def custom_dist(a,b):
    return np.sqrt((a[0] - b[0])**2 + (a[1]-b[1])**2+(a[2]-b[2])**6)

As you can see, we have defined a function very similar to the euclidean distance function, except that the third dimension has been raised to power 6 instead of power 2. In a sense, we are penalizing points more for being far apart in the third dimension, as compared to the first two dimensions. If you have more dimensions in your data, you just need to accommodate them in the function.

Now, to use this function as the metric in DBSCAN, simply pass it in the metric argument.

from sklearn.cluster import DBSCAN

data = np.array([X,Y,Z]).T
db_out = DBSCAN(eps=0.02, min_samples=4).fit(data)

If you need to pass in any specific params to the custom function, you can use the metric_params argument. It takes in a dict for all the extra arguments. For example, if we wish to provide weights to the 3 dimensions, here’s how that can be done:

from sklearn.cluster import DBSCAN
def custom_dist(a,b, w1, w2, w3):
    return np.sqrt(w1*(a[0] - b[0])**2 + w2*(a[1]-b[1])**2+w3*(a[2]-b[2])**6)



data = np.array([X,Y,Z]).T
db_out = DBSCAN(eps=100, min_samples=1,metric=custom_dist, metric_params={'w1':1,'w2':5,'w3':4}).fit(data)

In case you wish to check if this works, you can set all the weights to 1, and change the power of the third dimension to 2 instead of 6. This way, the above function will essentially become the euclidean distance function, and you should get the same output that you’d have gotten had you not passed in the metric and metric_params arguments.

That’s it. Hope you liked this tutorial. You can check other tutorials on python on iotespresso.com.

If you are interested in data science and machine learning using Python, you may find this course by Jose Portilla on Udemy to be very helpful. It has been the foundation course in Python for me and several of my colleagues.

Leave a comment Cancel reply