Linking python with PostGIS
GIS databases are sort of a necessary evil. Once your data gets too big for RAM you’ve got to start thinking either about relational databases or doing something fancy with dask. PostGIS is a spatial database extender for PostgreSQL object-relational database. In essence, PostGIS puts GIS functions into SQL queries allowing you to run queries and joins based on location. PostgreSQL is big nasty and powerful so this post provides a basic look at utilising PostGIS for the ‘GIS stuff’ and linking python to a PostGIS database for pre-processing and visualisation.
We’ll use the excellent python package osmnx to load some building and land use data from OpenStreetMap.
import osmnx as ox import matplotlib.pyplot as plt import geopandas as gp # get some data london = (51.501204, 0.001922) distance=2000 buildings = ox.footprints.footprints_from_point(london, distance=distance) landuse = ox.footprints.footprints_from_point(london, distance=distance, footprint_type='landuse')
osmnx provides the data back to us in geopandas geodataframes. Let’s make a plot of the data:
fig, ax = plt.subplots(figsize=(15, 15)) landuse.plot(column='landuse', ax=ax) buildings.plot(ax=ax, color='None',lw=1, alpha=1) ax.axis('off')
We’ll now dive into Postgres and the PostGIS extension. We can use
psql command-line utility for Postgres to execute SQL commands from the terminal to create the database and load the PostGIS extension for Postgres.
➜ psql -U postgres -c "CREATE DATABASE london;" ➜ psql -U postgres -d london -c "CREATE EXTENSION postgis;"
To get the dataframes into the
london database I’m going to save them as shapefiles and use the
shp2pgsql command-line utility.
(buildings['geometry']).to_file("buildings.shp") (landuse[['landuse', 'geometry']]).to_file("landuse.shp")
➜ hp2pgsql buildings.shp buildings | psql -U postgres -d london
Looking at the command we can see that
shp2pgsql produces a set of SQL instructions which we then pipe to
psql to do the actual insertion. If we change the pipe to
head we can see the SQL commands that
shp2pgsql has generated for us:
➜ shp2pgsql buildings.shp buildings | head Field fid is an FTDouble with width 11 and precision 0 Shapefile type: Polygon Postgis type: MULTIPOLYGON SET CLIENT_ENCODING TO UTF8; SET STANDARD_CONFORMING_STRINGS TO ON; BEGIN; CREATE TABLE "buildings" (gid serial, "fid" float8); ALTER TABLE "buildings" ADD PRIMARY KEY (gid); SELECT AddGeometryColumn('','buildings','geom','0','MULTIPOLYGON',2); INSERT INTO "buildings" ("fid",geom) VALUES ('0','0106000000010000000103000000010000001300000099709EA0038194BFA40DD1329CC0494002E66BE0586194BFEAFE5657AAC04940010BAA57DB5E94BFAEB25B70ABC04940F3A15577764394BFFC219111ABC049405131CEDF844294BF67ECF07CABC049406B9505B8C5A193BFC742194FA9C049400D068D4FB7A293BFC7B8E2E2A8C04940FF9C386F528793BF15281884A8C049400FE8F120E28B93BFE0EC7B79A6C0494020F8CE1E1E7893BF348FB234A6C04940EBFB15D79D9393BF7CAE00F099C04940DAEB38D961A793BF934CF3339AC04940EA36F28AF1AB93BFB20FB22C98C04940F89F466B56C793BF65A07C8B98C049409A10CE0248C893BFFAD51C2098C049407FAC962A076994BF2E3FCB4E9AC04940DD3B0F93156894BF99092BBA9AC04940EBA463737A8394BF4B9AF5189BC0494099709EA0038194BFA40DD1329CC04940'); INSERT INTO "buildings" ("fid",geom) VALUES ('1','0106000000010000000103000000010000000E0000006778584D7C5A92BFFC4FA335ABC04940E2E7BF07AF5D92BF677627E9ABC04940D08E650B523A92BFB2666490BBC04940C043AC59C23592BF8E76DCF0BBC04940CB64DDE45C2F92BF829A1029BCC049408BBBE6FAE36291BF953F845DB9C049407A956BC0D65B91BFAD252E11B9C0494051DEC7D11C5991BF66A77A8DB8C04940BDDD37633D5A91BF7227220DB8C04940A69C2FF65E7C91BFD3C2C0CEA8C049403C534376398091BF3F4BA13DA8C04940FD3CFCEBCB8891BFC8A4750AA8C0494056945C0F705292BFEA888DC3AAC049406778584D7C5A92BFFC4FA335ABC04940'); INSERT INTO "buildings" ("fid",geom) VALUES ('2','01060000000100000001030000000100000005000000158C4AEA043491BFAB35DE67A6C0494093E4B9BE0F0791BF83723678BAC049408647D1B9916890BF1F85EB51B8C049408341881A3B9790BF59857247A4C04940158C4AEA043491BFAB35DE67A6C04940');
Lets repeat this process for the
➜ shp2pgsql landuse.shp landuse | psql -U postgres -d london
And check that the tables have actually been added to the database:
➜ psql -U postgres -d london -c "\dt" List of relations Schema | Name | Type | Owner --------+--------------------+-------+---------- public | buildings | table | postgres public | landuse | table | postgres public | spatial_ref_sys | table | postgres (3 rows)
Doing GISy things with PostGIS
Lets now use PostGIS to do some rudimentary GIS analysis. We’re going to look at access to green spaces for the neighbourhood (very roughly!).
landuse dataframe defines an associated land use for each polygon. We’re interested only in polygons with an associated
grass land use class. Let’s visualise where the grassy areas defined in
fig, ax = plt.subplots(figsize=(15, 15)) (landuse[landuse['landuse']=='grass']).plot(column='landuse', ax=ax, color='g') buildings.plot(ax=ax, color='None',lw=1,alpha=1) ax.axis('off')
Now, we’ll use PostGIS to do some GISy things which we can then visualise later in python. First, we’ll consider which of the buildings are within 50m of a grassy area. We can construct an SQL query to do this making sure to limit the analysis to only land use polygons where
landuse field (
SELECT DISTINCT buildings.fid, buildings.geom, landuse.landuse FROM buildings, landuse WHERE ST_DWithin(buildings.geom::geography, landuse.geom::geography, 50) AND landuse.landuse='grass';
So this is a fairly standard looking SQL query - we select fields we need from the two tables
landuse with some constraints as provided by the
ST_DWithin within is the PostGIS function doing all the work here. It returns true if the geometries are within the specified distance of one another.
The use of the geography keyword
buildings.geom::geography is needed because our data is in WGS84 (i.e. latitude/longitude) and we want to do the calculation based on metres. This, of course, means reprojecting the data which is slow so an alternative is to use ST_Distance_Sphere which approximates the calculation in decimal degrees for us. But for good practice and to make the procedure robust to whatever projection the data is in let’s stick with the
::geography transformation here.
DISTINCT command here is important because it prevents us from storing duplicates of each building (for example when a building is within 50m of more than one grassy geometry).
We can store the query results into a table called
within50 by adding an
SELECT DISTINCT buildings.fid, buildings.geom, landuse.landuse INTO within50 FROM buildings, landuse WHERE ST_DWithin(buildings.geom::geography, landuse.geom::geography, 50) AND landuse.landuse='grass';
And let’s check the table has been saved:
london=# \dt List of relations Schema | Name | Type | Owner --------+--------------------+-------+---------- public | buildings | table | postgres public | landuse | table | postgres public | spatial_ref_sys | table | postgres public | within50 | table | postgres (4 rows)
We might wish to generalise this task a bit. How about finding the distance to the nearest grassy area for each building? This an example of a nearest neighbour search. There are several different ways to do a NN search with PostGIS - though unfortunately all of them are a bit involved. I’ve chosen one that hopefully makes some sense:
SELECT buildings.fid, ST_Distance(buildings.geom::geography, land.geom::geography) AS dist, buildings.geom INTO buildingsDistances FROM buildings JOIN LATERAL ( SELECT geom, landuse FROM landuse WHERE ST_DWithin(buildings.geom::geography, geom::geography, 2000) AND landuse = 'grass' ORDER BY buildings.geom <-> geom LIMIT 1 ) AS land ON true;
Breaking this down a bit makes the query more digestible. Key to this is query is the LATERAL JOIN method which essentially allows for a join in which the rows of one table are iterated over (as if with a for loop) and the rows matching conditions are then used for the join. The lateral join used here performs the nearest neighbour query and aliases the result as
( SELECT geom, landuse FROM landuse WHERE ST_DWithin(buildings.geom::geography, geom::geography, 2000) AND landuse = 'grass' ORDER BY buildings.geom <-> geom LIMIT 1 )
Let’s break this subquery down a bit further. First, we limit the geometries within 2000m of each other. This helps to limit the amount of spatial indexing we’ll do, an efficiency hack essentially. The nearest neighbour search is then effectively done by the following two lines:
ORDER BY buildings.geom <-> geom LIMIT 1
buildings.geom <-> geom is the PostGIS way (termed a “index-based distance operator” of referring to the distance between polygon bounding boxes. This is NOT the same as the true distance between the polygons because it’s based on the PostGIS spatial indexing R-tree (so it’s approximate!). This is what provides the necessary speed to an efficient nearest neighbour search. Then the
LIMIT 1 provides only the nearest geometry… nice!
Let’s look at
london=# SELECT fid, dist FROM buildingsDistances LIMIT 10; fid | dist -----+-------------- 0 | 539.95580113 1 | 315.44344397 2 | 164.43032264 3 | 4.94973921 4 | 237.93579893 5 | 16.53049982 6 | 79.73473226 7 | 353.70800526 8 | 72.46755079 9 | 231.41497931 (10 rows)
Connecting to PostGIS in python
To create a connection to our PostGIS database from python we can use the versatile sqlalchemy toolkit.
from sqlalchemy import create_engine db_string = "postgres://postgres:123@localhost:5432/london" db_connection = create_engine(db_string)
We can execute standard SQL queries which provide back an iterator for the rows of the query:
# check the connection c = db_connection.execute("SELECT * FROM buildings;") c.next()
(1, 0.0, ‘0106000000010000000103000077 … (354 characters truncated) … 893BFFAD5CC04940’)
The long string in there is the binary representation of the polygon. Geopandas can connect to our PostGIS database via the
from_postgis function with our database connection
within50 = gp.GeoDataFrame.from_postgis("SELECT * FROM within50;", db_connection, geom_col='geom', index_col='fid', coerce_float=True)
Let’s visualise buildings within 50m of a grassy geometry:
fig, ax = plt.subplots(figsize=(15, 15)) within50.plot(ax=ax, color='None', alpha=1) (landuse[landuse['landuse']=='grass']).plot(column='landuse', ax=ax, color='g') ax.axis('off')
And now let’s visualise the distance for each building:
from mpl_toolkits.axes_grid1 import make_axes_locatable distances = gp.GeoDataFrame.from_postgis("SELECT * FROM buildingsDistances;", db_connection, geom_col='geom', crs=None, index_col='fid') fig, ax = plt.subplots(figsize=(15, 15)) divider = make_axes_locatable(ax) cax = divider.append_axes("right", size="5%", pad=0.1) im = distances.plot(column='dist', cmap='viridis', ax=ax, legend=True, cax=cax) (landuse[landuse['landuse']=='grass']).plot(column='landuse', ax=ax, color='g')
- The official PostGIS workshop is fantastic
- The PostGIS cheat sheet is also pretty good