James Brennan

A blog about statistics, data science, and remote sensing.

16 Mar 2020

Fast and easy gridding of point data with geopandas

This post looks at using the geopandas library to do fast efficient gridding of point data onto a regular grid. Geopandas is a python package that provides a geospatial extension to pandas – so that dataframes can store geographic data such as points and polygons.

Data

A simple csv of point data provides a useful starting point for this. dob.csv contains locations of wildfires detected by the NASA MODIS instruments for a year over North and South America.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("dob.csv", index_col=0)
df = df.sort_values('y')
df = df.drop(columns="geometry")
df.head()
x y dob
111 -5.4247e+06 -4.38456e+06 206
110 -5.42516e+06 -4.38456e+06 206
109 -5.42238e+06 -4.38317e+06 206
108 -5.42284e+06 -4.38317e+06 206
107 -5.42794e+06 -4.38317e+06 199

So as expected the data is essentially a collection of around 230-thousand x and y points in a projected sinusoidal projection (the MODIS projection) and the day of the year (dob) that the fire occurred.

geopandas

Given our spatial data is point data with x and y coordinates it’s simple to convert the df into a geopandas GeoDataFrame:

import geopandas
gdf = geopandas.GeoDataFrame(df, 
            geometry=geopandas.points_from_xy(df.x, df.y),
            crs="+proj=sinu +lon_0=0 +x_0=0 +y_0=0 +a=6371007.181 +b=6371007.181 +units=m +no_defs")
gdf.head()
x y dob geometry
111 -5.4247e+06 -4.38456e+06 206 POINT (-5424696.941458113 -4384559.892867883)
110 -5.42516e+06 -4.38456e+06 206 POINT (-5425160.25417464 -4384559.892867883)
109 -5.42238e+06 -4.38317e+06 206 POINT (-5422380.377875471 -4383169.954718296)
108 -5.42284e+06 -4.38317e+06 206 POINT (-5422843.690592 -4383169.954718296)
107 -5.42794e+06 -4.38317e+06 199 POINT (-5427940.130473807 -4383169.954718296)

So geopandas has taken the x and y fields and made simple point geometries. We don’t need the x and y fields anymore so let’s drop those and then use geopandas to plot the points:

gdf = gdf.drop(columns=['x', 'y'])
gdf.tail()
dob geometry
6 197 POINT (-5506239.979567026 6673324.712505849)
5 197 POINT (-5506703.292283554 6673324.712505849)
4 205 POINT (-5507166.605000081 6673324.712505849)
1 191 POINT (-5510409.794015776 6673324.712505849)
0 205 POINT (-5504850.041417442 6674714.650655435)

Like pandas, geopandas has built in plotting routines:

gdf.plot(markersize=.1, figsize=(8, 8))

png

So that doesn’t make a lot of sense yet although geopandas has correctly plotted a marker for each point using the geometry column of the dataframe. Let’s make a more useful plot of the data:

ax = gdf.plot(markersize=.1, figsize=(12, 8), column='dob', cmap='jet')
plt.autoscale(False)
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
world.to_crs(gdf.crs).plot(ax=ax, color='none', edgecolor='black')
ax.axis('off')

png

That looks better! A nice feature of geopandas is that it has built on-the-fly reprojection – we’ve taken the world geodataframe and reprojected it from WGS84 to the MODIS sinusoidal projection that the fire data is in with the coordinate reference systems. world.to_crs(gdf.crs)

Gridding with geopandas

Turning to the gridding, geopandas has operations for merging geometries and aggregating statistics to overlapping geometries in ways that will be familiar to pandas users.

First, let’s make the grid that the data is being binned into. To make a regular rectangular mesh for the binned data we can loop over the extent of the data and use shapely to produce each individual grid cell as a geometric polygon:

# total area for the grid
xmin, ymin, xmax, ymax= gdf.total_bounds
# how many cells across and down
n_cells=30
cell_size = (xmax-xmin)/n_cells
# projection of the grid
crs = "+proj=sinu +lon_0=0 +x_0=0 +y_0=0 +a=6371007.181 +b=6371007.181 +units=m +no_defs"
# create the cells in a loop
grid_cells = []
for x0 in np.arange(xmin, xmax+cell_size, cell_size ):
    for y0 in np.arange(ymin, ymax+cell_size, cell_size):
        # bounds
        x1 = x0-cell_size
        y1 = y0+cell_size
        grid_cells.append( shapely.geometry.box(x0, y0, x1, y1)  )
cell = geopandas.GeoDataFrame(grid_cells, columns=['geometry'], 
                                 crs=crs)

Plotting the grid over the data shows how the fires are generally clustered into a few regions and we’ll end up with many “empty” grid cells:

ax = gdf.plot(markersize=.1, figsize=(12, 8), column='dob', cmap='jet')
plt.autoscale(False)
cell.plot(ax=ax, facecolor="none", edgecolor='grey')
ax.axis("off")

png

geopandas sjoin and dissolve

Geopandas can then combine geodataframes based on the geometries of each row, in a manner similar to how vanilla pandas merges based on column values.

sjoin is used to merge the two dataframes:

merged = geopandas.sjoin(gdf, cell, how='left', op='within')

so gdf and cell will be merged according to the geometries in cell – eg each grid using the operator rule within – or each row of gdf will be associated with the grid cell that its location falls within.

Now we want to count the number of fires in each grid cell. This can be achieved by aggregating merged so that the number of rows for each grid cell is counted. In geopandas this aggregation process is referred to as dissolving – i.e. making the data more sparse:

# make a simple count variable that we can sum
merged['n_fires']=1
# Compute stats per grid cell -- aggregate fires to grid cells with dissolve
dissolve = merged.dissolve(by="index_right", aggfunc="count")
# put this into cell
cell.loc[dissolve.index, 'n_fires'] = dissolve.n_fires.values

Then finally let’s plot the grid and the number of fires in each grid cell with geopandas:

ax = cell.plot(column='n_fires', figsize=(12, 8), cmap='viridis', vmax=5000, edgecolor="grey")
plt.autoscale(False)
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
world.to_crs(cell.crs).plot(ax=ax, color='none', edgecolor='black')
ax.axis('off')

png

r-trees and fast spatial joins

The reason that spatial joins in geopandas are fast is that geopandas leverages a useful data partitioning algorithm – the R-tree. R-trees are data structures which index spatial data in a way that makes identifying data from a spatial region faster than a brute force search approach. An R-tree collects points/geometries nearby each other into a tree-like structure of bounding boxes that define a spatial area that is then a quick method for spatial indexing.

In geopanads, the r-tree is built when the attribute gdf.sindex is accessed:

r_tree = gdf.sindex
print(r_tree)
rtree.index.Index(bounds=[-11105837.471524423, -4384559.892867883, -4067190.682031316, 6674714.650655435], size=226719)

The r-tree has a leaves method which shows the spatial partitioning structure of each leaf of the r-tree:

for leaf in r_tree.leaves()[:2]:
    idxs, indices, bbox = leaf
    print(f'-> points in box {idxs}: ',  indices, '\n bounding box: ', bbox, '\n')
print(f"number of leaves: {len(r_tree.leaves())}")
-> points in box 6:  [182654, 182656, 182665, 182663, 182659, 182664, 182655, 182657, 182660, 182652, 182653, 182675, 182677, 182671, 182670, 182674, 182680, 182678, 182673, 182672, 182669, 182668, 182679, 182676, 182686, 182681, 182684, 182687, 182690, 182688, 182685, 182682, 182689, 182683, 182692, 182696, 182693, 182699, 182695, 182697, 182694, 182691, 182698, 182703, 182702, 182706, 182704, 182707, 182701, 182705, 182700, 182710, 182711, 182709, 182708, 182712, 182713, 182714, 182715, 182716, 182718, 182717, 182719, 182721, 182720, 182722, 182728, 182725, 182724, 182729] 
 bounding box:  [-10882520.74215797, 2750455.9416639777, -10852868.728300184, 2759722.1959945355] 

-> points in box 7:  [182730, 182723, 182727, 182726, 182741, 182733, 182731, 182740, 182739, 182742, 182734, 182732, 182735, 182738, 182736, 182737, 182743, 182756, 182744, 182754, 182746, 182749, 182750, 182747, 182748, 182753, 182752, 182751, 182755, 182745, 182757, 182758, 182760, 182759, 182762, 182761, 182763, 182764, 182767, 182769, 182766, 182768, 182770, 182765, 182776, 182777, 182774, 182772, 182771, 182775, 182773, 182779, 182781, 182780, 182778, 182782, 182784, 182788, 182786, 182785, 182789, 182787, 182783, 182795, 182797, 182791, 182790, 182794, 182793, 182792] 
 bounding box:  [-10869084.67337866, 2759722.1959945355, -10855648.604599353, 2763892.010443287] 

number of leaves: 3239

Plotting the first 200 bounding boxes shows how the data is partitioned geographically:

import shapely.geometry
from descartes import PolygonPatch
ax = gdf.plot(markersize=.1, figsize=(12, 8), column='dob', cmap='jet')
plt.autoscale(False)
for leaf in r_tree.leaves()[:200]:
    idxs, indices, bbox = leaf
    r_box = shapely.geometry.box(*bbox)
    box_patch = PolygonPatch(r_box, fc='None', ec='k')
    ax.add_patch(box_patch)
world.to_crs(gdf.crs).plot(ax=ax, color='none', edgecolor='grey')
ax.axis('off')

png