View as a slideshow.
So far we’ve limited our analyses to variables contained within a given table. In real projects, however, you frequently need to join information contained in multiple source tables. Let’s look at the tools dplyr
provides to make this easy. As always a picture is worth a thousand words and the Data Transformation cheat sheet is a great reference (see the Combine Tables section).
Going back to the GISS Surface Temperature data we looked at in the last class, you’ll see that individual tables are available for Norther and Southern hemisphere observations.
Use what we learned last class to download these two files (Shell -> wget):
For my code below, I’ve saved them to a data
subfolder of my current working directory and named them northern-mean.csv
and southern-mean.csv
, respectively.
Take a look at the files in the R Studio editor and clean up as necessary. You’ll notice that these are slightly less aweful than the file I pointed you to last class; mea culpa, I wanted to show you how bad it could get!
For my code below, clean versions of these files are named northern-mean-clean.csv
and southern-mean-clean.csv
, respectively.
These files are in comma-separated format. You could use read.table
to load them and set the separator to a ","
or use the shortcut function named read.csv
:
northern_raw <- read.csv("data/northern-mean-clean.csv", na.strings = "***")
southern_raw <- read.csv("data/southern-mean-clean.csv", na.strings = "***")
As we did in the last class (1) we want to tidy up these tables, (2) clean up column names and (3) make a numeric month column. Since we want to do all of those things to both tables, let’s make a little function!
library(tidyverse)
tidyTemps <- function(table) {
# gather
tidy <- gather(table[1:13], key = "month", value = "index", Jan:Dec)
# lower case all variable names
colnames(tidy) <- tolower(colnames(tidy))
# make numeric month column
month_n <- 1:12
names(month_n) <- colnames(table[2:13])
tidy$month_n <- month_n[tidy$month]
# Functions return their last value; return the tidy table
tidy
}
Now we can use our function to tidy and clean up each table:
northern <- tidyTemps(northern_raw)
southern <- tidyTemps(southern_raw)
Let’s say we’d like to look at a correlation between temperatures in the two hemispheres. To do that we need to get all of the data into the same table. We can do that with one of the join
functions from dplyr
. In this case, our tables both have the same number of rows, so it really doesn’t matter which we choose. I’ll use left_join
here.
temps <- left_join(northern, southern, by = c("year", "month_n"))
Take a look at the result and make sure you understood what happened there. Let’s do some clean up:
temps$month.x <- NULL
temps$month.y <- NULL
colnames(temps)[2] <- "northern"
colnames(temps)[4] <- "southern"
Is this table Tidy? Well, yes and no! If we want to look at a correlation between northern and southern hemisphere temperatures, it is. Make that plot now!
What if we want to look at plots of Northern and Southern temperatures as a function of time? Do the tidying that you would need to and make that plot!
Then try:
geom_smooth
to add a kernal trend line (which is this a nice approach given these data)lm