James Brennan

A blog about statistics, data science, and remote sensing.

13 May 2020

Linking python with PostGIS

GIS databases are sort of a necessary evil. Once your data gets too big for RAM you’ve got to start thinking either about relational databases or doing something fancy with dask. PostGIS is a spatial database extender for PostgreSQL object-relational database. In essence, PostGIS puts GIS functions into SQL queries allowing you to run queries and joins based on location. PostgreSQL is big nasty and powerful so this post provides a basic look at utilising PostGIS for the ‘GIS stuff’ and linking python to a PostGIS database for pre-processing and visualisation.

We’ll use the excellent python package osmnx to load some building and land use data from OpenStreetMap.

import osmnx as ox
import matplotlib.pyplot as plt
import geopandas as gp
# get some data
london = (51.501204, 0.001922)
distance=2000
buildings = ox.footprints.footprints_from_point(london, distance=distance)
landuse = ox.footprints.footprints_from_point(london, distance=distance, 
 footprint_type='landuse')

osmnx provides the data back to us in geopandas geodataframes. Let’s make a plot of the data:

fig, ax = plt.subplots(figsize=(15, 15))
landuse.plot(column='landuse', ax=ax)
buildings.plot(ax=ax, color='None',lw=1, alpha=1)
ax.axis('off')

png

PostGIS

We’ll now dive into Postgres and the PostGIS extension. We can use psql command-line utility for Postgres to execute SQL commands from the terminal to create the database and load the PostGIS extension for Postgres.

➜ psql -U postgres -c "CREATE DATABASE london;"
➜ psql -U postgres -d london -c "CREATE EXTENSION postgis;" 

To get the dataframes into the london database I’m going to save them as shapefiles and use the shp2pgsql command-line utility.

(buildings['geometry']).to_file("buildings.shp")
(landuse[['landuse', 'geometry']]).to_file("landuse.shp")
➜ hp2pgsql buildings.shp buildings | psql -U postgres -d london

Looking at the command we can see that shp2pgsql produces a set of SQL instructions which we then pipe to psql to do the actual insertion. If we change the pipe to head we can see the SQL commands that shp2pgsql has generated for us:

➜ shp2pgsql buildings.shp buildings | head


Field fid is an FTDouble with width 11 and precision 0
Shapefile type: Polygon
Postgis type: MULTIPOLYGON[2]
SET CLIENT_ENCODING TO UTF8;
SET STANDARD_CONFORMING_STRINGS TO ON;
BEGIN;
CREATE TABLE "buildings" (gid serial,
"fid" float8);
ALTER TABLE "buildings" ADD PRIMARY KEY (gid);
SELECT AddGeometryColumn('','buildings','geom','0','MULTIPOLYGON',2);
INSERT INTO "buildings" ("fid",geom) VALUES ('0','0106000000010000000103000000010000001300000099709EA0038194BFA40DD1329CC0494002E66BE0586194BFEAFE5657AAC04940010BAA57DB5E94BFAEB25B70ABC04940F3A15577764394BFFC219111ABC049405131CEDF844294BF67ECF07CABC049406B9505B8C5A193BFC742194FA9C049400D068D4FB7A293BFC7B8E2E2A8C04940FF9C386F528793BF15281884A8C049400FE8F120E28B93BFE0EC7B79A6C0494020F8CE1E1E7893BF348FB234A6C04940EBFB15D79D9393BF7CAE00F099C04940DAEB38D961A793BF934CF3339AC04940EA36F28AF1AB93BFB20FB22C98C04940F89F466B56C793BF65A07C8B98C049409A10CE0248C893BFFAD51C2098C049407FAC962A076994BF2E3FCB4E9AC04940DD3B0F93156894BF99092BBA9AC04940EBA463737A8394BF4B9AF5189BC0494099709EA0038194BFA40DD1329CC04940');
INSERT INTO "buildings" ("fid",geom) VALUES ('1','0106000000010000000103000000010000000E0000006778584D7C5A92BFFC4FA335ABC04940E2E7BF07AF5D92BF677627E9ABC04940D08E650B523A92BFB2666490BBC04940C043AC59C23592BF8E76DCF0BBC04940CB64DDE45C2F92BF829A1029BCC049408BBBE6FAE36291BF953F845DB9C049407A956BC0D65B91BFAD252E11B9C0494051DEC7D11C5991BF66A77A8DB8C04940BDDD37633D5A91BF7227220DB8C04940A69C2FF65E7C91BFD3C2C0CEA8C049403C534376398091BF3F4BA13DA8C04940FD3CFCEBCB8891BFC8A4750AA8C0494056945C0F705292BFEA888DC3AAC049406778584D7C5A92BFFC4FA335ABC04940');
INSERT INTO "buildings" ("fid",geom) VALUES ('2','01060000000100000001030000000100000005000000158C4AEA043491BFAB35DE67A6C0494093E4B9BE0F0791BF83723678BAC049408647D1B9916890BF1F85EB51B8C049408341881A3B9790BF59857247A4C04940158C4AEA043491BFAB35DE67A6C04940');

Lets repeat this process for the landuse.shp file:

➜ shp2pgsql landuse.shp landuse | psql -U postgres -d london

And check that the tables have actually been added to the database:

➜ psql -U postgres -d london -c "\dt"

 List of relations
 Schema | Name | Type | Owner 
--------+--------------------+-------+----------
 public | buildings | table | postgres
 public | landuse | table | postgres
 public | spatial_ref_sys | table | postgres
(3 rows)

Doing GISy things with PostGIS

Lets now use PostGIS to do some rudimentary GIS analysis. We’re going to look at access to green spaces for the neighbourhood (very roughly!).

The landuse dataframe defines an associated land use for each polygon. We’re interested only in polygons with an associated grass land use class. Let’s visualise where the grassy areas defined in landuse are:

fig, ax = plt.subplots(figsize=(15, 15))
(landuse[landuse['landuse']=='grass']).plot(column='landuse', ax=ax, color='g')
buildings.plot(ax=ax, color='None',lw=1,alpha=1)
ax.axis('off')

png

Now, we’ll use PostGIS to do some GISy things which we can then visualise later in python. First, we’ll consider which of the buildings are within 50m of a grassy area. We can construct an SQL query to do this making sure to limit the analysis to only land use polygons where landuse field (landuse.landuse) is grass:

SELECT 
 DISTINCT buildings.fid,
 buildings.geom, 
 landuse.landuse
FROM 
 buildings, landuse
WHERE 
 ST_DWithin(buildings.geom::geography, landuse.geom::geography, 50)
 AND landuse.landuse='grass';

So this is a fairly standard looking SQL query - we select fields we need from the two tables buildings and landuse with some constraints as provided by the WHERE. ST_DWithin within is the PostGIS function doing all the work here. It returns true if the geometries are within the specified distance of one another.

The use of the geography keyword buildings.geom::geography is needed because our data is in WGS84 (i.e. latitude/longitude) and we want to do the calculation based on metres. This, of course, means reprojecting the data which is slow so an alternative is to use ST_Distance_Sphere which approximates the calculation in decimal degrees for us. But for good practice and to make the procedure robust to whatever projection the data is in let’s stick with the ::geography transformation here.

Also, the DISTINCT command here is important because it prevents us from storing duplicates of each building (for example when a building is within 50m of more than one grassy geometry).

We can store the query results into a table called within50 by adding an INTO statement:

SELECT 
 DISTINCT buildings.fid,
 buildings.geom, 
 landuse.landuse
INTO within50
FROM 
 buildings, landuse
WHERE 
 ST_DWithin(buildings.geom::geography, landuse.geom::geography, 50)
 AND landuse.landuse='grass';

And let’s check the table has been saved:

london=# \dt
 List of relations
 Schema | Name | Type | Owner 
--------+--------------------+-------+----------
 public | buildings | table | postgres
 public | landuse | table | postgres
 public | spatial_ref_sys | table | postgres
 public | within50 | table | postgres
(4 rows)

We might wish to generalise this task a bit. How about finding the distance to the nearest grassy area for each building? This an example of a nearest neighbour search. There are several different ways to do a NN search with PostGIS - though unfortunately all of them are a bit involved. I’ve chosen one that hopefully makes some sense:

SELECT buildings.fid,
 ST_Distance(buildings.geom::geography, land.geom::geography) AS dist,
 buildings.geom
INTO buildingsDistances
FROM buildings
JOIN LATERAL (
 SELECT geom, landuse
 FROM landuse
 WHERE ST_DWithin(buildings.geom::geography, geom::geography, 2000)
 AND landuse = 'grass'
 ORDER BY buildings.geom <-> geom
 LIMIT 1
) AS land ON true;

Breaking this down a bit makes the query more digestible. Key to this is query is the LATERAL JOIN method which essentially allows for a join in which the rows of one table are iterated over (as if with a for loop) and the rows matching conditions are then used for the join. The lateral join used here performs the nearest neighbour query and aliases the result as land:

(
 SELECT geom, landuse
 FROM landuse
 WHERE ST_DWithin(buildings.geom::geography, geom::geography, 2000)
 AND landuse = 'grass'
 ORDER BY buildings.geom <-> geom
 LIMIT 1
)

Let’s break this subquery down a bit further. First, we limit the geometries within 2000m of each other. This helps to limit the amount of spatial indexing we’ll do, an efficiency hack essentially. The nearest neighbour search is then effectively done by the following two lines:

 ORDER BY buildings.geom <-> geom
 LIMIT 1

buildings.geom <-> geom is the PostGIS way (termed a “index-based distance operator” of referring to the distance between polygon bounding boxes. This is NOT the same as the true distance between the polygons because it’s based on the PostGIS spatial indexing R-tree (so it’s approximate!). This is what provides the necessary speed to an efficient nearest neighbour search. Then the LIMIT 1 provides only the nearest geometry… nice!

Let’s look at buildingsDistances:

london=# SELECT fid, dist FROM buildingsDistances LIMIT 10;
 fid | dist 
-----+--------------
 0 | 539.95580113
 1 | 315.44344397
 2 | 164.43032264
 3 | 4.94973921
 4 | 237.93579893
 5 | 16.53049982
 6 | 79.73473226
 7 | 353.70800526
 8 | 72.46755079
 9 | 231.41497931
(10 rows)

Connecting to PostGIS in python

To create a connection to our PostGIS database from python we can use the versatile sqlalchemy toolkit.

from sqlalchemy import create_engine
db_string = "postgres://postgres:123@localhost:5432/london"
db_connection = create_engine(db_string)

We can execute standard SQL queries which provide back an iterator for the rows of the query:

# check the connection
c = db_connection.execute("SELECT * FROM buildings;")
c.next()

(1, 0.0, ‘0106000000010000000103000077 … (354 characters truncated) … 893BFFAD5CC04940’)

The long string in there is the binary representation of the polygon. Geopandas can connect to our PostGIS database via the from_postgis function with our database connection db_connection:

within50 = gp.GeoDataFrame.from_postgis("SELECT * FROM within50;",
 db_connection,
 geom_col='geom', 
 index_col='fid', 
 coerce_float=True)

Let’s visualise buildings within 50m of a grassy geometry:

fig, ax = plt.subplots(figsize=(15, 15))
within50.plot(ax=ax, color='None', alpha=1)
(landuse[landuse['landuse']=='grass']).plot(column='landuse', ax=ax, color='g')
ax.axis('off')

png

And now let’s visualise the distance for each building:

from mpl_toolkits.axes_grid1 import make_axes_locatable 
distances = gp.GeoDataFrame.from_postgis("SELECT * FROM buildingsDistances;", db_connection, geom_col='geom', crs=None, index_col='fid')

fig, ax = plt.subplots(figsize=(15, 15))
divider = make_axes_locatable(ax)
cax = divider.append_axes("right", size="5%", pad=0.1)
im = distances.plot(column='dist', 
 cmap='viridis',
 ax=ax, legend=True, 
 cax=cax)
(landuse[landuse['landuse']=='grass']).plot(column='landuse',
 ax=ax, color='g')

png

Further reading