When working with spatial data, one is rarely interested in working with only one source of data. This tutorial will introduce a set of tools for linking vector data with other data sources. It begins by introducing how to link spatial vector data with non-spatial data from in table format, then turns to the problem of linking multiple sources of spatial data through spatial joins and intersects.

This tutorial uses the sp, rgdal, and raster libraries from the RGIS1 tutorial. If you have not yet installed those, please revisit that tutorial for directions. In addition, this tutorial will also make use of the rgeos library, installation of which is discussed in part0_setup.


1. RGEOS Overview

The over command will cover a remarkable number of situations, but it is really only interesting if two shapes intersect exactly. In practice, however, we are often interested in sligthly more flexible questions: how many cities are within 10km of a drone strike? Or how many people live close to a government project?

When we want to do fancer geometric operations, we use the rgeos library. rgeos (which stands for “R interface to the Geometry Engine - Open Source library”) is a set of tools for geometric operations. GEOS a huge library, and one that underlies many spatial tools (not just in R). Whenever you’re thinking about a geometric operation, rgeos is the first place to look.

rgeos tools can be broadly divided into three camps. Here some of the most commonly used ones.

Calculating Properties

  • gArea: calculate area of a shape
  • gLength: calculates length of line / circumference of polygons
  • gDistance: distance between items

Making New Shapes

  • gBuffer: Expand points into circles of given radius
  • gCentroid: Collapse polygons to their centroids
  • gDifference, gUnion, and gIntersection: execute set operations on polygons
  • gUnionCascaded: dissolves a collection of shapes into a single shape.
  • gSimplify: If your polyon is too high a resolution (higher than needed for analysis, and too high for fast processing), reduces the number of vertices while maintaining shape to the best of it’s ability.

Testing Geometric Relationships

  • gIntersects: test if shapes intersect. Primarily useful for testing whether two polygons intersect, since this is not somethingover can do.
  • gContains: Is one spatial object entirely within another?
  • gIsValid: Very useful – make sure your geometries aren’t corrupt!

2. Key RGEOS Concepts

The syntax for these tools varies both across tools and depending on the input types, there are three concepts that come up constantly.

2.1: Units

rgeos isn’t a spatial library – it just does geometry. It takes x-y coordinates and applies geometric formula. Thus the units will always be the units of the x-y coordinates, which come from your projection. So here, we can check the projection to find our units:

proj4string(districts)
[1] "+proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0"

The units (+units=m) are meters.

2.2: byid

byid is an option on most rgeos commands. If byid=TRUE, each observation in a Spatial* object is handled separately; if byid=FALSE, a Spatial* object is treated as one big geometry. So if one were working with a SpatialPolygons object of US states, when byid=TRUE, the analysis would be conducted for each state; if byid=FALSE, it would essentially run the analysis against the United States as a whole.

Intuitively, the distinction can often be thought of as the difference between asking whether about a spatial relationship between each observation in the Spatial* object on which the tool is being executed, or any observation in the Spatial* polygon.

To see this illustrated, let’s consider our data we used before on the location of government grants. Suppose we want to identify regions that are within 7km of government grants. We could either ask for the area within 7km of each government project (which would yield a different answer for each grant), or the area within 7km of any government grant (which would yield one large area). You can see this distinction in the following plots:

buffered.grants.byidTRUE <- gBuffer(grants.newproj, width = 7000, byid = TRUE)
buffered.grants.byidFALSE <- gBuffer(grants.newproj, width = 7000, byid = FALSE)

par(mfcol = c(1, 2))
plot(districts, main = "byid TRUE")
plot(buffered.grants.byidTRUE, col = "blue", add = T)

plot(districts, main = "byid FALSE")
plot(buffered.grants.byidFALSE, col = "blue", add = T)

As seen in the figures, when byid=TRUE, a separate buffer polygon is created around each grant, creating a large number of (overlapping) polygons. When byid=FALSE, there are no discrete polygons, just a single feature that covers all points within 7km of any government grant. Indeed, if we look at the summary of these buffers, we see this as well – in the first case, you see that under features, the buffers created with byid=TRUE have 6 distinct features (observations), while the buffers created with byid=FALSE have only 1.

buffered.grants.byidTRUE
class       : SpatialPolygonsDataFrame 
features    : 6 
extent      : 546582.3, 581813.7, 4134618, 4189618  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
variables   : 3
names       : GrantBudge, JobsCreate,                   ProjectNam 
min values  :     210000,          0, Air Traffic Control Retrofit 
max values  :  120000000,         32,         Wetland Preservation 
buffered.grants.byidFALSE
class       : SpatialPolygons 
features    : 1 
extent      : 546582.3, 581813.7, 4134618, 4189618  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 

2.3: id

The output of almost all rgeos commands will be organized by id. For example, with gArea, each area has a label: 0, 1, and 2. When working with Spatial*DataFrame objects, id will be the rowname for a given observation, just like with tools we’ve worked with before.

To see an object’s ids, you can use the combination of the names command and the geometry command (if you don’t use the geometry command you’ll get column names for objects with associated DataFrames:

names(geometry(districts))
[1] "0" "1" "2"

Things get a little more complicated when rgeos creates new polygons using tools like gBuffer or gIntersect. rgeos tends to try and do smart things, but the behavior does vary across tools, so always be careful.

Here’s an example – let’s create buffers (polygons of fixed radius) around our grants the buffers 7km, so we get back polygons for all points within 7km of each project.

proj4string(grants.newproj)  # check units!
[1] "+proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0"
buffered.grants <- gBuffer(grants.newproj, width = 7000, byid = TRUE)
names(geometry(buffered.grants))
[1] "1" "2" "3" "4" "5" "6"
plot(districts)
plot(buffered.grants, col = "blue", add = T)

Note that points has given rise to a new polygon, many of them overlapping, and they were given simple names based on the names of the points that generated them. However, if we want a polygon of points within 7km of any grant, we can pass byid=FALSE. Now we have one polygon (named buffer) instead of 5.

buffered.grants <- gBuffer(grants.newproj, width = 7000, byid = FALSE)
names(geometry(buffered.grants))
[1] "buffer"
plot(districts)
plot(buffered.grants, col = "blue", add = T)


Exercise 1

Answers below, but no cheating!

Let’s try and figure out what percentage of residents of Pakistans Federally Administered Tribal Areas (FATA, the regions with the most drone strikes) live within 7km of a drone strike.

  1. Subset the pk.dist data for the Province of Fata.

  2. Create a buffer of 7km with the gBuffer command to help us estimate the total area that is within 7km of any drone strike (what value of byid should you use?).

  3. Plot your buffers over the districts so you can see them.


3. Set operations

rgeos can also do set operations on polygons, like returning areas in which different polygons coincide, don’t coincide.

This can be useful for lots of things. For example, the figure of grant buffers above looks a little odd because the buffered polygons spread into the ocean. We can fix this by “clipping” the buffered polygons using gIntersection – essentially keeping only the part of the buffer that also coincides with districts.

intersection <- gIntersection(buffered.grants, districts, byid = TRUE)
plot(districts)
plot(intersection, col = "blue", add = T)

Interestingly, note that while buffered.grants was previously one feature, because we passed byid=TRUE, the intersection with the districts executed the intersection for each congressional district, creating three distinct polygons.

intersection
class       : SpatialPolygons 
features    : 3 
extent      : 546582.3, 581813.7, 4134618, 4187377  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 

3.1 Managing Output from Set Operations

One challenge with set operations is that the number of features generated by these functions has no mechanical relationship to the number of features that were passed to the set function as inputs. As a result, we can’t just cbind results back into the original data frame as we’ve been able to do in past tutorials because outputs may be of entirely different lengths than the inputs!

rgeos tries to address this problem by generating new ids for the output of set operations that are somewhat intuitive. In general, new ids are concatenations of the ids of the objects that gave rise to a new observation. For example, if you ran gIntersection and one output polygon was created from the overlap of a polygon with the id of 8 from the first Spatial* object passed and a polygon with the id z from the second Spatial* object, that new polygon would have the id: 8 z.

We can see this behavior in the ids of polygons generated from the intersection we just executed between buffered.grants and districts:

names(geometry(buffered.grants))
[1] "buffer"
names(geometry(districts))
[1] "0" "1" "2"
names(geometry(intersection))
[1] "buffer 0" "buffer 1" "buffer 2"

The challenge with this is that these concatenated id variables aren’t always easy to merge with the DataFrames of your original Spatial* objects. For example, say we wanted to measure the share of each electoral district that is within 7km of a government grant. To do so, we would want to divide the area within 7km of each grant in each district by the total area of each district. The problem is that if we measure the area within 7km of each grant – the area of the shapes in the intersection object – we get back a list organized by these concatenated ids.

gArea(intersection, byid = TRUE)
 buffer 0  buffer 1  buffer 2 
178460991  63473835 250339443 

The following code snippet will convert lists like this into a DataFrame, and will split these concatenated id variables into two columns – one for each component of the id – to make merging easier. This is something you chould be able to copy-and-paste after almost any set operation.

# Run calculation
result <- gArea(intersection, byid = TRUE)

# Convert from list to DataFrame and name result
result <- as.data.frame(result)
colnames(result) <- c("area_near_grants")

# ids are stored as row-names, so make into a column.
ids <- as.character(rownames(result))

# Split the id into two columns based on the space in the middle.  Note the
# one place this could break is if the original names had spaces...
new.id.columns <- t(as.data.frame(strsplit(ids, " ")))
colnames(new.id.columns) <- c("id1", "id2")

# Now your results are a data.frame with separate columns for each id
# component!
result <- cbind(result, new.id.columns)
result
         area_near_grants    id1 id2
buffer 0        178460991 buffer   0
buffer 1         63473835 buffer   1
buffer 2        250339443 buffer   2

Now you can use this reformated data to merge with the original data on districts:

# Move the rownames (which form the ids) into a column for merging
districts$id2 <- rownames(as.data.frame(districts))
districts <- merge(districts, result, by = "id2", type = "left")
as.data.frame(districts)
  id2 DISTRICT          NAME Shape_Leng Shape_Area area_near_grants    id1
1   0       18     Palo Alto  308447.51 1815914163        178460991 buffer
2   1       12 San Francisco   70727.76  105358679         63473835 buffer
3   2       14     Peninsula  200487.79  706830475        250339443 buffer

Now we can calculate the share of each district within 7km!

districts$area_near_grants/districts$Shape_Area
[1] 0.09827612 0.60245474 0.35417183

3.2 byid Issues

One thing to be aware of is that rgeos sometimes struggles when byid is set to FALSE. For example, if we want to clip our buffered polygon by the extent of all electoral districts – but not add cuts where the polygon crosses the lines between districts – one might try the following code:

The following code:

gIntersection(buffered.grants, districts, byid = FALSE)

But oddly, this generates the error: Error in RGEOSBinTopoFunc(spgeom1, spgeom2, byid, id, drop_lower_td, "rgeos_intersection") : TopologyException: no outgoing dirEdge found at 570113.57942913717 4148852.3204373871

This kind of problem will come up from time to time – my general solution is to first dissolve the layer we want to consider as one polygon into a single polygon using gUnionCascaded, then execute the second command:

merged.districts <- gUnionCascaded(districts)
plot(merged.districts)

intersection2 <- gIntersection(buffered.grants, merged.districts, byid = FALSE)
intersection2
class       : SpatialPolygons 
features    : 1 
extent      : 546582.3, 581813.7, 4134618, 4187377  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
plot(districts)
plot(intersection2, col = "blue", add = T)

Now the intersection is only one part!

4. Last Word on rgeos

rgeos is a very large and powerful library, and as a result you may be feeling a little overwhelmed at the moment. This is normal – because rgeos has so many tools and each one behaves somewhat differently, it is not possible to provide comprehensive training for everything in a single tutorial. With that in mind, the aim of this tutorial has been to give you a sense of what (a) what the rgeos library has to offer so you know where to turn when you run into a problem, and (b) to highlight aspects of rgeos that are common to most tools so you’re prepared to figure out new tools within the least difficulty possible. But if you don’t feel quite as comfortable as you have after other tutorials, that’s prefectly normal.


Exercise 2.

Answers below, but no cheating!

Previously we created polygons for areas within 7km of a drone strike. Now let’s try and estimate the share of each district in FATA that is within 7km of a drone strike. (Note: districts are the administrative level below provinces. There are 13 districts within the province of FATA.)

  1. Intersect the buffers with Districts using gIntersection.

  2. Measure the share of each district within 7km of any drone strike using the gArea command for each district, and for the buffered strikes in each district. When you merge the areas of your intersections back in with districts, be sure to look at that table – why are there missing values? Do you need to worry about them?

  3. Run this analysis just for strikes in 2012. If you’ve used ArcGIS before, how would you re-run this analysis in ArcGIS? Would it be easier or harder?


Answers to Exercise 1

pk.dist <- readOGR(dsn = "RGIS2_data/shapefiles", layer = "pk_districts")
strikes.wrongproj <- readOGR(dsn = "RGIS2_data/shapefiles", layer = "pk_drone_strikes")
dist.crs <- CRS(proj4string(pk.dist))
strikes <- spTransform(strikes.wrongproj, dist.crs)

# Subset to Fata
fata <- pk.dist[pk.dist$PROVINCE == "Fata", ]
fata$OBJECTID <- NULL  # Drop column called objectid -- came from shapefile. Just confuses things. 

# Buffer strikes
proj4string(strikes)  # Check projection units
strike.buffer <- gBuffer(strikes, width = 7000, byid = FALSE)
strike.buffer.inter <- gIntersection(strike.buffer, fata, byid = TRUE)

Answers to Exercise 2

####### Merge intersection areas with original areas

# Now to merge back in with Districts based on row numbers
result <- gArea(strike.buffer.inter, byid = TRUE)

# Convert from list to DataFrame
result <- as.data.frame(result)
colnames(result) <- c("area_near_strikes")

# ids are stored as row-names, so make into a column.
ids <- as.character(rownames(result))

# Split the id into two columns based on the space in the middle.  Note the
# one place this could break is if the original names had spaces...
new.id.columns <- t(as.data.frame(strsplit(ids, " ")))
colnames(new.id.columns) <- c("id1", "id2")

# Now your results are a data.frame with separate columns for each id
# component!
result <- cbind(result, new.id.columns)

fata$id2 <- rownames(as.data.frame(fata))

fata <- merge(fata, result, by = "id2", type = "left")

####### Calculate share of area within strike buffer

# Note first that in places where no part of a district was within 7km of a
# strike, there were no `intersection` polygons. Thus when we merge our
# data, we get missing for area near strikes. These need to be zeros.

fata[is.na(fata$area_near_strikes), "area_near_strikes"] <- 0
fata$share_covered <- fata$area_near_strikes/fata$Shape_Area
fata$share_covered

# Only for 2012: add: strikes <- strikes[strikes$year == 2012,] To start of
# code and just rerun!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.