When working with spatial data, one is rarely interested in working with only one source of data. This tutorial will introduce a set of tools for linking vector data with other data sources. It begins by introducing how to link spatial vector data with non-spatial data from in table format, then turns to the problem of linking multiple sources of spatial data through spatial joins and intersects.

This tutorial uses the sp, rgdal, and raster libraries from the RGIS1 tutorial. If you have not yet installed those, please revisit that tutorial for directions. In addition, this tutorial will also make use of the rgeos library, installation of which is discussed in part0_setup.


1. RGEOS Overview

The over command will cover a remarkable number of situations, but it is really only interesting if two shapes intersect exactly. In practice, however, we are often interested in sligthly more flexible questions: how many cities are within 10km of a drone strike? Or how many people live close to a government project?

When we want to do fancer geometric operations, we use the rgeos library. rgeos (which stands for “R interface to the Geometry Engine - Open Source library”) is a set of tools for geometric operations. GEOS a huge library, and one that underlies many spatial tools (not just in R). Whenever you’re thinking about a geometric operation, rgeos is the first place to look.

rgeos tools can be broadly divided into three camps. Here some of the most commonly used ones.

Calculating Properties

  • gArea: calculate area of a shape
  • gLength: calculates length of line / circumference of polygons
  • gDistance: distance between items

Making New Shapes

  • gBuffer: Expand points into circles of given radius
  • gCentroid: Collapse polygons to their centroids
  • gDifference, gUnion, and gIntersection: execute set operations on polygons
  • gUnionCascaded: dissolves a collection of shapes into a single shape.
  • gSimplify: If your polyon is too high a resolution (higher than needed for analysis, and too high for fast processing), reduces the number of vertices while maintaining shape to the best of it’s ability.

Testing Geometric Relationships

  • gIntersects: test if shapes intersect. Primarily useful for testing whether two polygons intersect, since this is not somethingover can do.
  • gContains: Is one spatial object entirely within another?
  • gIsValid: Very useful – make sure your geometries aren’t corrupt!

2. Key RGEOS Concepts

The syntax for these tools varies both across tools and depending on the input types, there are three concepts that come up constantly.

2.1: Units

rgeos isn’t a spatial library – it just does geometry. It takes x-y coordinates and applies geometric formula. Thus the units will always be the units of the x-y coordinates, which come from your projection. So here, we can check the projection to find our units:

proj4string(districts)
[1] "+proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0"

The units (+units=m) are meters.

2.2: byid

byid is an option on most rgeos commands. If byid=TRUE, each observation in a Spatial* object is handled separately; if byid=FALSE, a Spatial* object is treated as one big geometry. So if one were working with a SpatialPolygons object of US states, when byid=TRUE, the analysis would be conducted for each state; if byid=FALSE, it would essentially run the analysis against the United States as a whole.

Intuitively, the distinction can often be thought of as the difference between asking whether about a spatial relationship between each observation in the Spatial* object on which the tool is being executed, or any observation in the Spatial* polygon.

To see this illustrated, let’s consider our data we used before on the location of government grants. Suppose we want to identify regions that are within 7km of government grants. We could either ask for the area within 7km of each government project (which would yield a different answer for each grant), or the area within 7km of any government grant (which would yield one large area). You can see this distinction in the following plots:

buffered.grants.byidTRUE <- gBuffer(grants.newproj, width = 7000, byid = TRUE)
buffered.grants.byidFALSE <- gBuffer(grants.newproj, width = 7000, byid = FALSE)

par(mfcol = c(1, 2))
plot(districts, main = "byid TRUE")
plot(buffered.grants.byidTRUE, col = "blue", add = T)

plot(districts, main = "byid FALSE")
plot(buffered.grants.byidFALSE, col = "blue", add = T)

As seen in the figures, when byid=TRUE, a separate buffer polygon is created around each grant, creating a large number of (overlapping) polygons. When byid=FALSE, there are no discrete polygons, just a single feature that covers all points within 7km of any government grant. Indeed, if we look at the summary of these buffers, we see this as well – in the first case, you see that under features, the buffers created with byid=TRUE have 6 distinct features (observations), while the buffers created with byid=FALSE have only 1.

buffered.grants.byidTRUE
class       : SpatialPolygonsDataFrame 
features    : 6 
extent      : 546582.3, 581813.7, 4134618, 4189618  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
variables   : 3
names       : GrantBudge, JobsCreate,                   ProjectNam 
min values  :     210000,          0, Air Traffic Control Retrofit 
max values  :  120000000,         32,         Wetland Preservation 
buffered.grants.byidFALSE
class       : SpatialPolygons 
features    : 1 
extent      : 546582.3, 581813.7, 4134618, 4189618  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 

2.3: id

The output of almost all rgeos commands will be organized by id. For example, with gArea, each area has a label: 0, 1, and 2. When working with Spatial*DataFrame objects, id will be the rowname for a given observation, just like with tools we’ve worked with before.

To see an object’s ids, you can use the combination of the names command and the geometry command (if you don’t use the geometry command you’ll get column names for objects with associated DataFrames:

names(geometry(districts))
[1] "0" "1" "2"

Things get a little more complicated when rgeos creates new polygons using tools like gBuffer or gIntersect. rgeos tends to try and do smart things, but the behavior does vary across tools, so always be careful.

Here’s an example – let’s create buffers (polygons of fixed radius) around our grants the buffers 7km, so we get back polygons for all points within 7km of each project.

proj4string(grants.newproj)  # check units!
[1] "+proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0"
buffered.grants <- gBuffer(grants.newproj, width = 7000, byid = TRUE)
names(geometry(buffered.grants))
[1] "1" "2" "3" "4" "5" "6"
plot(districts)
plot(buffered.grants, col = "blue", add = T)