If you have a list or set of points, and wish to find out if they lie within a given Polygon, the first option that may come to your mind may be Shapely. And there are some efficient ways of calculating this using Shapely. Let’s explore them one by one.
We have about 690K locations scattered across India. We will create a Polygon connecting the 4 major cities of India (Mumbai, Delhi, Kolkata and Chennai), and see how many points lie within this Polygon.
Thus, the code for the setup is as follows. In order to replicate this at your end, you will need the lat_lon_dump.csv file, which can be found in the zipped format here.
import pandas as pd from shapely.geometry import Point, MultiPoint, Polygon import time df = pd.read_csv('lat_lon_dump.csv') points = [(row['longitude'],row['latitude']) for ind, row in df.iterrows()] points = list(set(points)) #remove duplicate points polygon_corners = [(72.8777,19.0760),(77.1025,28.7041),(88.3639,22.5726),(80.2707,13.0827),(72.8777,19.0760)] #Mumbai, Delhi, Kolkata, Chennai p = Polygon(polygon_corners)
The naive approach
The naive approach will consist of a for loop, wherein each point will be checked. The points that lie within the polygon will be stored in a separate array.
points_within_p_1 =  t1 = time.time() for point in points: if(Point(point).intersects(p)): points_within_p_1.append(point) t2 = time.time() print(t2-t1)
This takes about 20.12 seconds.
The MultiPoint approach
A slightly better approach would be to convert the set of points into a MultiPoint object and then find its intersection with the polygon.
t1 = time.time() p2 = MultiPoint(points) points_within_p_2 = list(map(lambda point: (point.x,point.y), p2.intersection(p))) t2 = time.time() print(t2-t1)
This approach takes 13.24 seconds, which is about 65% of the time taken by the naive approach. Please note that the map function (converting the MultiPoint object returned by the intersection back to a list of points), takes considerable amount of time. If you were to not do the back-conversion, and just work with the MultiPoint output, as shown below:
t1 = time.time() p2 = MultiPoint(points) points_within_p_2 = p2.intersection(p) t2 = time.time() print(t2-t1)
then the process would take just 6.6 seconds, which is approx. 30% of the time taken by the naive approach.
Matplotlib Path approach
While the shapely MultiPoint approach does give us a significant time reduction, it is not the winner. We have an unexpected winner here: matplotlib path. Yes! Who would have imagined matplotlib to beat shapely here. But it does. Have a look at the implementation below:
from matplotlib.path import Path import numpy as np t1 = time.time() path_p = Path(p.boundary) inside_points = path_p.contains_points(points) points_within_p_3 = np.array(points)[inside_points] t2 = time.time() print(t2-t1)
This takes just 2.01 seconds. This is about 10% of the time taken by the naive approach. And what’s more, it even preserves the order of the points (something which MultiPoint doesn’t).
Needless to say, the output from all the 3 approaches above is exactly the same. A quick sanity can be done by printing the lengths of the 3 arrays, which is exactly the same: 262141.
I hope you enjoyed this article, and it helps you increase the speed of your code. For more tutorials on Python, check out: https://iotespresso.com/category/python/