When working with spatial data, one is rarely interested in working with only one source of data. This tutorial will introduce a set of tools for linking vector data with other data sources. It begins by introducing how to link spatial vector data with non-spatial data in table format, then turns to the problem of linking multiple sources of spatial data through spatial joins and intersects.

This tutorial uses the sp, rgdal, and raster libraries from the RGIS1 tutorial. If you have not yet installed those, please revisit that tutorial for directions. In addition, this tutorial will also make use of the rgeos library, installation of which is discussed in part0_setup.


1.Spatial* + Non-Spatial

An attribute join combines tabular data with a Spatial* object by associating each observation in a table with a GIS object (a polygon, line, or point). If you have done attribute joins of shapefiles in GIS software like ArcGIS or QGis, or merged two datasets in Stata or R, this process is analogous – in an Attribute Join, a Spatial*Dataframe (be that a SpatialPolygonsDataFrame, SpatialPointsDataFrame, or SpatialLinesDataFrame) is merged with a table (an R data.frame) using a common unique identifier.

Assume we have:

where:

Then we can just merge worldCountries with countryData on those variables, using the merge method in sp:

In this case, that means our merge would look like the following:

worldCountries <- merge(worldCountries, countryData, by.x = "id-number", by.y = "countryID")

Important Note: Always pass the merge command your Spatial*DataFrame object. Never merge the data.frame associated with your Spatial*DataFrame object directly, as in the following snippet:

# DON'T DO THIS!
worldCountries@data <- merge(worldCountries@data, countryData, by.x = "id-number", by.y = "countryID")

If merge is passed two data.frames instead of a Spatial*DataFrame and a data.frame, it may jumble the order of rows in the output data, corrupting your data.

That’s it!


Exercise 1

  1. Download and unzip the RGIS2_Data folder.

  2. Load the CSV table district_vote_shares.csv from the data loaded above into a dataframe in R and name it vote_shares.

  3. Load the shapefile congressional_districts from the folder shapefiles and call it districts.

  4. Check out the column names of vote_shares and of districts to determine which one might contain the unique identifier for the join. Hint: use the names() command.

  5. Join the vote_shares data frame with districts using merge as described above. Use the names() command to see if the join was successful.

  6. Now we could plot one of the variables we just joined - try democratic vote share!

spplot(districts, "dem_vote_share")

2. Spatial* + Spatial*

Combining different Spatial* data sources is one of the most common things you will do in GIS research. Consider the following examples:

Most of the time, you answer these questions using some form of Spatial Join. Unlike the “Attribute Joins” described above, a spatial join in your first purely spatial operation. In a spatial join, observations from different datasets are joined based not on a variable, but by their relationship in space.

2.1 Managing Coordinate Reference Systems (CRS)

To combine two Spatial* datasets, the first thing you have to do is make sure they have the same CRS. If you try and work with two Spatial* objects in R that are not in the same CRS, you will get results, but those results will be nonsense! Note that this is very different from programs like ArcGIS that will take care of this problem for you! So before we can join Spatial* objects, we need to learn how to re-project different data sources into a common CRS.

Make sure you remember the differences between defining a CRS and re-projecting:

  • Defining a projection is when the user tells the computer how it should interpret the x-y coordinates associated with a Spatial* object. For example, if data comes from a GPS device, one must tell the computer those x-y coordinates are longitudes and latitudes. It does not change the values of the numbers, just how the computer interprets them.
  • Re-projecting is when you tell the computer to convert coordinates from one representation (like longitude and latitude) to another (like meters from some fixed reference point). It changes not just the proj4string associated with an object, but also all the actual x and y coordinates.

Re-projecting vector data requires two tools from the sp and rgdal packages:

  • a Coordinate Reference System CRS object with the new CRS you wish to apply
  • the spTransform() method

As previously noted, a CRS object includes all the information needed to project a spatial object, generally including both a Geographic Coordinate System (the model of the Earth used to create the data) and a projection (a way of converting points on the three-dimensional Earth onto a two-dimensional plane).

Through package rgdal, the CRS() function has access to a large library of coordinate systems and transformations, so you just need to know the code for the CRS you want. Codes – often called a “projection strings” – can be found online here.

Once you’ve found the projection you want, you can create the appropriate CRS object using one of two codes – the proj4 string, or the EPSG code. These are equivalent, but look a little different.

For example, you can create a CRS object for UTM zone 33N (EPSG:32633) by either passing the full proj4 code:

MyNewProjection <- CRS("+proj=utm +zone=33 +ellps=WGS84 +datum=WGS84 +units=m +no_defs")

or the EPSG code:

MyNewProjection <- CRS("+init=EPSG:32633")

Once you’ve defined a CRS object with the new CRS you want to use, all you have to do is execute the spTransform function on the object you want to re-project. If, for example, we have an object called MyCity and we want to reproject this into a new CRS called MyNewCRS, we would type:

MyNewCRS <- CRS("definition of projection goes here as string")
MyCity.reprojected <- spTransform(MyCity, MyNewCRS)

Note that you can also retrieve the CRS from an existing Spatial* object with the proj4string() command! So if you have two files – file.a and file.b – a common idiom for reprojecting file.b into the CRS of file.a is:

common.crs <- CRS(proj4string(file.a))
file.b.reprojected <- spTransform(file.b, common.crs)

Exercise 2

  1. If you haven’t already, create a directory R_Workshop on your Desktop and unzip RGIS2_Data into that folder.

  2. Load the rgdal, and sp packages.

  3. If you haven’t already, read in the congressional_districts shapefile into R and name it districts.

  4. Read the federal_grants shapefile into R and name it grants.

  5. What is the CRS of districts? What is the CRS of grants?

  6. Reproject grants so it matches the projection of districts and assign it to a new object called grants.newproj.

  7. You can use the range() command from the R base package to compare the coordinates before and after reprojection and confirm that you actually have transformed them. range() simply returns the min and max value of a vector of numbers that you give it. So you can check with:
    range(coordinates(grants))
    and
    range(coordinates(grants.newproj))

  8. You can also compare them visually with:

par(mfrow = c(1, 2))
plot(grants, axes = TRUE)
plot(grants.newproj, axes = TRUE)

grants

2.2. Spatial Joins: over Command

They primary tool for doing spatial joins is the over command from the sp library.

The exact behavior of over depends on the inputs being used, but the basic idea is: “For each item of first position (the SOURCE), over returns information about items of second argument (TARGET) that intersect”.

For example, if for every grant (SOURCE) I wanted to get information about their district (TARGET), I could run:

library(maptools)
grants.districts <- over(grants.newproj, districts)  # Get district data
grants.districts
  DISTRICT          NAME Shape_Leng Shape_Area
1       12 San Francisco   70727.76  105358679
2       14     Peninsula  200487.79  706830475
3       14     Peninsula  200487.79  706830475
4       14     Peninsula  200487.79  706830475
5       18     Palo Alto  308447.51 1815914163
6       18     Palo Alto  308447.51 1815914163
# Recombine!
grants.newproj <- spCbind(grants.newproj, grants.districts)
grants.newproj
class       : SpatialPointsDataFrame 
features    : 6 
extent      : 553582.3, 574813.7, 4141618, 4182618  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
variables   : 7
names       : GrantBudge, JobsCreate,                   ProjectNam, DISTRICT,          NAME, Shape_Leng, Shape_Area 
min values  :     210000,          0, Air Traffic Control Retrofit,       12,     Palo Alto,   70727.76,  105358679 
max values  :  120000000,         32,         Wetland Preservation,       18, San Francisco,  308447.51, 1815914163 

Note the use of the spCbind command from the maptools library for merging the datasets – this can also be accomplished without the maptools library by simple assignment of each vector of data.

Note that because districts is a SpatialPolygonDataFrame, over returned the relevant row of data for each point. If districts did not have any data, we would just get the index of the intersecting polygon for each grant. We can see this behavior if we use the geometry() function to strip away the DataFrame:

grants.with.districts2 <- over(grants.newproj, geometry(districts))
head(grants.with.districts2)
1 2 3 4 5 6 
2 3 3 3 1 1 

A few caveats: * By default, over will return the first item in the TARGET that intersects with an item in SOURCE if there are multiple items. * over can only handle intersection of two SpatialPolygons objects after package rgeos has been loaded. * more details are found in this document

Multiple Intersections

By default, when there are multiple TARGET observations that intersect a SOURCE observation, over will just return the first one. There are two ways to address this.

Multiple Intersections Option 1: returnList = TRUE

over normally returns a vector, which can only have one item per row. However, if you pass the argument returnList = TRUE, over will return a named list, where: * The name for each list entry is the index of the SOURCE observation * The contents are the indices of all items in the TARGET that intersect the SOURCE observation

Once you have this list, you can compute on it with tools like sapply.

Here’s an example with just indices of intersecting TARGET observations:

over.list <- over(districts, geometry(grants.newproj), returnList = TRUE)
over.list
$`0`
[1] 5 6

$`1`
[1] 1

$`2`
[1] 2 3 4

Then you can process these down (say, to get the number of each) and merge back in:

library(maptools)
num.grants <- sapply(over.list, length)
districts <- spCbind(districts, num.grants)
districts
class       : SpatialPolygonsDataFrame 
features    : 3 
extent      : 498739.1, 610118, 4089593, 4191546  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
variables   : 5
names       : DISTRICT,          NAME, Shape_Leng, Shape_Area, num.grants 
min values  :       12,     Palo Alto,   70727.76,  105358679,          1 
max values  :       18, San Francisco,  308447.51, 1815914163,          3 

Here’s an example where one gets back the relevant data.frame rows of intersecting TARGET observations.

over(districts, grants.newproj, returnList = TRUE)
$`0`
  GrantBudge JobsCreate              ProjectNam DISTRICT      NAME
5     210000          0 NSF Graduate Fellowship       18 Palo Alto
6  120000000         12    DARPA Robotics Grant       18 Palo Alto
  Shape_Leng Shape_Area
5   308447.5 1815914163
6   308447.5 1815914163

$`1`
  GrantBudge JobsCreate          ProjectNam DISTRICT          NAME
1    7000000          5 Emergency Starbucks       12 San Francisco
  Shape_Leng Shape_Area
1   70727.76  105358679

$`2`
  GrantBudge JobsCreate                   ProjectNam DISTRICT      NAME
2   10000000         22               Runway Repairs       14 Peninsula
3     750000         32         Wetland Preservation       14 Peninsula
4     270000          3 Air Traffic Control Retrofit       14 Peninsula
  Shape_Leng Shape_Area
2   200487.8  706830475
3   200487.8  706830475
4   200487.8  706830475

Multiple Intersections Option 2: fn or aggregate

However, if the second argument has a data.frame associated with it, over can be instructed to aggregate the variables associated with intersecting TARGET observations.

For example, we can use this to get the average value of grants in each district:

over(districts, grants.newproj[, "GrantBudge"], fn = mean)
  GrantBudge
0   60105000
1    7000000
2    3673333

Since we are now aggregating GrantBudge values in grants.newproj by districts, we can also use the aggregate method in sp for spatial aggregation, which in addition to the over command above returns a Spatial* object:

aggregate(grants.newproj[, "GrantBudge"], districts, mean)
class       : SpatialPolygonsDataFrame 
features    : 3 
extent      : 498739.1, 610118, 4089593, 4191546  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=utm +zone=10 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
variables   : 1
names       :       GrantBudge 
min values  : 3673333.33333333 
max values  :         60105000 

If we had other information on food trucks – like their value – we could also ask for things like their mean value by passing the mean function.

Note that over selects anything that intersects, including polygons that only touch each other. When working with two polygon sets, it makes often more sense to apply area weighting, where weights are proportional to the amount of overlap. This is obtained by

aggregate(polygons_A, polygons_B, mean, areaWeighted = TRUE)

Exercise 2

Answer code at bottom of exercise, but no cheating!

Now that we’ve covered the basics of GIS in R, we can start working with real data. In the following exercises, we will work with data on US drone strikes in Pakistan collected by the Bureau of Investigative Journalism. Save your code from this exercise – we’ll use these shapefiles more.

  1. Load the pk_districts shapefile from the shapefiles folder and save as pk_dist.

  2. Load the pk_drone_strikes shapefile from the shapefiles folder and save as strikes.

  3. Plot the two shapefiles together using the plot command. The result should look something like the following. If it does not, consider whether you skipped an IMPORTANT STEP to combining different Spatial* objects.

test

  1. Look at the pk_drone_strikes data. What information is included?

  2. Calculate the number of drone strikes per district in Pakistan. (Think: what is your SOURCE and what is your TARGET for this operation? Is it 1-1 or Many to 1) Where are drone strikes concentrated?

  3. What is the average fatality rate per district?

3. Rasters + SpatialPolygons

Finally, sometimes we have a SpatialPolygons, and want to know something about the properties within the polygons based on information in a raster dataset. For example, we might have the polygons of electoral districts in Africa, and want to know the average level of light as seen by satellites.

In these situations, we use the extract tool, which returns the values of all the raster cells that fall within each polygon. (More specifically, it returns a list of numeric vectors, where the vector in position N corresponds to the raster cell values inside polygon N.) You can then compute on these values as you wish!

3.1 Reprojecting Raster data

As with Spatial* objects, the first step is to make sure everything has a common projection.

Raster objects can be reprojected using the projectRaster function. However, be aware that reprojecting rasters is not quite as costless as reprojecting vector data. Rasters must also have a regular grid pattern, and since different projections will not necessarily preserve that feature, reprojecting rasters means creating a new grid and computing each value based on overlapping cells from the old grid. Thus it is computationally difficult and can lead to losses of precision. Thus you are probably better off reprojecting your SpatialPolygon object rather than your raster.

3.2 The Extract Tool

pollution <- raster("RGIS2_Data/pollution.tif")
raster.crs <- CRS(projection(pollution))
districts.reprojected <- spTransform(districts, raster.crs)
extracted.values <- extract(pollution, districts.reprojected)

# A few example outputs -- low index polygons are in south SF and don't
# actually have values, so jumping to top.
extracted.values
[[1]]
 [1] 6.8 8.0 6.6 6.6 5.9 5.7 5.9 5.6 5.5 6.0 5.9 6.2 5.5 5.4 4.7 6.4 6.0

[[2]]
    
5.4 

[[3]]
[1] 6.1 6.2 5.9 9.5 6.1 5.9

You can get average values using the sapply() function:

# A subsets of outputs
sapply(extracted.values, mean)
[1] 6.041176 5.400000 6.616667

Exercise 3

Answers below, but no cheating!

  1. Load the population raster from the RGIS2_Data folder and call it pk.pop with something like the following command (depending on your working directory): pk.pop <- raster("pakp00g"). This file is a raster where the value of each cell is an estimate of the number of people living in that area. (Note that you are pointing R at a folder with the raster components – raster, unlike rgdal, is smart enough to figure out what you’re asking!)

  2. Use the extract command to estimate the population of each Pakistani district.

  3. Compute the number of drone-strikes per capita for each district. .


Solutions

Answers to Exercise 2

pk.dist <- readOGR(dsn = "RGIS2_data/shapefiles", layer = "pk_districts")
strikes <- readOGR(dsn = "RGIS2_data/shapefiles", layer = "pk_drone_strikes")

# Re-project
dist.crs <- CRS(proj4string(pk.dist))
strikes.projected <- spTransform(strikes, dist.crs)


# Plot
plot(pk.dist)
par(new = T)
plot(strikes.projected, type = ".", col = "blue", add = T)
par(new = F)


# Look at data
head(strikes.projected)

# Number per district
over.list <- over(pk.dist, geometry(strikes.projected), returnList = TRUE)
num.strikes <- sapply(over.list, length)
pk.dist <- spCbind(pk.dist, num.strikes)
pk.dist[pk.dist$num.strikes != 0, ]

# Avg Fatality
avg.fatality <- over(pk.dist, strikes.projected[c("Killed", "CiviliansK")], 
    fn = mean)
pk.dist <- spCbind(pk.dist, avg.fatality)
pk.dist[!is.na(pk.dist$Killed), ]

Answers to Exercise 3

pk.pop <- raster("RGIS2_Data/pakp00g")
new.crs <- CRS(projection(pk.pop))
pk.dist.rasterproj <- spTransform(pk.dist, new.crs)

pops <- extract(pk.pop, pk.dist.rasterproj)
pops.summed <- sapply(pops, sum)
pk.dist.rasterproj <- spCbind(pk.dist.rasterproj, pops.summed)

pk.dist.rasterproj$strikes.percap <- pk.dist.rasterproj$num.strikes/pk.dist.rasterproj$pops.summed

head(pk.dist.rasterproj[pk.dist.rasterproj$num.strikes > 0, ])

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.