JOM299

week 2: air pollution in Euston road

Introduction to ggplot

ggplot is going to be our best friend for this module

Great link to bookmark: ggplot cheatsheet

Importing data

“Importing” data means loading a file from your computer into your programming environment, then storing it in a variable to make it available to us.

Where is our data?

- london air - camden open data

CSV files

Our preferred data format. CSV is like an Excel spreadsheet, but just plain text:

name,surname,occupation
basile,simon,journalist
mick,jagger,musician
theresa,may,prime minister

R will recognise the structure above and understand that the commas represent columns. It will show the structure above as a table-like representation:

name	surname	occupation
basile	simon	journalist
mick	jagger	musician
theresa	may	prime minister

using CSV in R

We start by loading in the CSV file containing our data:

library(readr)

df <- read_csv("data/airpollutioneuston.csv")
View(df)

Loading ggplot

install.packages("ggplot2")
install.packages("dplyr")
library(ggplot2)
library(dplyr)

WHO guideline: 40ug/m3 annual mean

The WHO guideline for NO2 pollution is to stay under 40ug/m3 annually.

Did this happen on Euston Road? We load dplyr to get some basic stats back from our dataset very quickly:

library(dplyr)

df %>% summary()

Calculating a mean

We could also calculate our mean manually with summarise - many handy functions we can use, actually

df %>% summarise(annual_mean = mean(Value))

  annual_mean
        <dbl>
1        82.8

# how many observations do we have?
df %>% summarise(observations = n())

  observations
         <int>
1          365

Clean data a bit

One issue with our dataset: ReadingDateTime column comes out as a string (see df %>% summary() showing character value).

We will need to parse that as a date!

Dates in programming

Dates as odd creatures. We parse strings and convert them into dates, but how does the computer know the format of the date?

2018-01-02
2018/02/01

These dates could be identical or different depending on how we parse them.

Date formats to the rescue

Date format specifiers

2018-01-02 parsed with %Y-%m-%d becomes 2nd Jan 2018
2018-01-02 parsed with %Y-%d-%m becomes 1st Feb 2018

Cleaning our air pollution data

We’ll use British standards in this case:

df <- df %>% mutate(Date = as.Date(ReadingDateTime,
                                   format = "%d/%m/%Y")) %>%
  select(Date, Value)
  
  Date       Value
  <date>     <dbl>
1 2017-01-01  69.9
2 2017-01-02  57.5
3 2017-01-03  91.9
4 2017-01-04  67.9

Basic plot in ggplot

# install.packages("ggplot2")
library(ggplot2)

ggplot(df, aes(x = Date, y = Value)) +
  geom_point()

What just happened?

We just used ggplot, the leading R visualisation package, to create a scatterplot. Ggplot is a grammar, ie a chart is composed of several bricks:

a dataset,
geometries,
a coordinate system

Colours, opacity, scales

alpha is opacity
colours are written in hex codes - What to consider when choosing colours
geom_hline is a new geometry! We can also use geom_vline for a vertical line

ggplot(df, aes(Date, Value), color='#254251') +
  geom_point(alpha = 0.5, color="#254251") +
  geom_hline(yintercept=40) +
  scale_y_continuous(breaks = c(40, 100, 150, 200, 250),
                     labels = c(40, 100, 150, 200, 250))

Gratuitous styles

library(scales)

df$alpha <- rescale(df$Value, to=c(0,1))

ggplot(df, aes(Date, Value), color='#254251') +
  geom_point(alpha = df$alpha, color="#254251") +
  geom_hline(yintercept=40) +
  scale_y_continuous(breaks = c(40, 100, 150, 200, 250),
                     labels = c(40, 100, 150, 200, 250))

Averages

We want to calculate a 30-day rolling average. This is super wasy in R: we need rollmean, from the zoo package.

Syntax:

rollmean(data$column, period)

#install.packages("zoo")
library(zoo)

df_mean <- df %>%
  mutate(mean = rollmean(Value, 30, fill = NA))

ggplot(df_mean, aes(Date, Value), color='#254251') +
  geom_hline(yintercept=40) +
  geom_point(alpha = df$alpha, color="#254251") +
  geom_line(aes(x = Date, y = mean)) +
  scale_y_continuous(breaks = c(40, 100, 150, 200, 250),
                     labels = c(40, 100, 150, 200, 250))

All together

We can also use pipes to avoid mutating our dataset as we go along, like so:

dataframe %>%
  do something on it %>%
  like filtering, adding columns, etc %>%
  then send it to ggplot like so %>%
  ggplot() +
    add geometries, etc

df <- read_csv("data/airpollutioneuston.csv")
df %>% filter(!is.na(Value)) %>%
    mutate(Date = as.Date(ReadingDateTime,
                          format = "%d/%m/%Y"),
           mean = rollmean(Value, 30, fill = NA)) %>%
    select(Date, Value, mean) %>%
    ggplot() +
    geom_hline(yintercept = 40) +
    geom_point(aes(x = Date, y = Value, alpha = 0.5, color = "steelblue")) +
    geom_line(aes(x = Date, y = mean)) +
    scale_y_continuous(breaks = c(40, 100, 150, 200, 250),
                       labels = c(40, 100, 150, 200, 250)) +
    ggtitle("Hourly NO2 concentration in Euston road") +
    xlab("Date") + ylab("NO2 concentration") + theme(legend.position="none")

Reading list

https://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen

http://datadrivenjournalism.net/resources/when_should_i_use_logarithmic_scales_in_my_charts_and_graphs

https://www.datacamp.com/community/blog/the-easiest-way-to-learn-ggplot2#gs.QnUNY8Y

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

week2-ggplot.org

week2-ggplot.org

JOM299

week 2: air pollution in Euston road

Introduction to ggplot

Importing data

Where is our data?

CSV files

using CSV in R

Loading ggplot

WHO guideline: 40ug/m3 annual mean

Calculating a mean

Clean data a bit

Dates in programming

Date formats to the rescue

Cleaning our air pollution data

Basic plot in ggplot

What just happened?

Colours, opacity, scales

Gratuitous styles

Averages

All together

Reading list

Files

week2-ggplot.org

Latest commit

History

week2-ggplot.org

File metadata and controls

JOM299

week 2: air pollution in Euston road

Introduction to ggplot

Importing data

Where is our data?

CSV files

using CSV in R

Loading ggplot

WHO guideline: 40ug/m3 annual mean

Calculating a mean

Clean data a bit

Dates in programming

Date formats to the rescue

Cleaning our air pollution data

Basic plot in ggplot

What just happened?

Colours, opacity, scales

Gratuitous styles

Averages

All together

Reading list