Data_Visualization_-_Tropical_Storms.Rmd

---
title: "Tropical Storm Data"
author: "Scott Stoltzman"
date: "9/12/2017"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE, fig.align = 'center')
```
```{r load_libraries}
library(tidyverse)
library(ggthemes)
library(ggmap)
library(htmlwidgets)
```

# Exploratory Data Analysis of Tropical Storms in R  

The disastrous impact of recent hurricanes, Harvey and Irma, generated a large influx of data within the online community. I was curious as to what the history of hurricanes and tropical storms looked like so I found a data set on [data.world](https://data.world/dhs/historical-tropical-storm) and started some basic Exploratory data analysis (EDA).  


EDA is crucial to starting any project. Through EDA you can start to identify errors & inconsistencies in your data, find interesting patterns, see correlations and start to develop hypotheses to test. For most people, basic spreadsheets and charts are pretty handy and provide a great place to start. They are an easy-to-use method to manipulate and visualize your data quickly. Data scientists may cringe at the idea of using a graphical user interface (GUI) to kick-off the EDA process but the reality is clear, those tools are very effective and efficient when used properly. However, if you're reading this, you're probably trying to take EDA to the next level. The best way to learn is to get your hands dirty, let's get started.  


The original source of the data was can be found at [DHS.gov](https://hifld-dhs-gii.opendata.arcgis.com/datasets/3ea21accbfab4ed8b14ede2e802cc2ec_0).  


----


#### Step 1:  Take a look at your data set and see how it is laid out  


```{r read_data}
# data source https://data.world/dhs/historical-tropical-storm
data = read_csv('data/Historical_Tropical_Storm_Tracks.csv')
knitr::kable(head(data))
```
  
  
Fortunately, this is a tidy data set which will make life easier and appears to be cleaned up substantially. The column names are relatively straightforward with the exception of "ID" columns.

The description as given by [DHS.gov](https://hifld-dhs-gii.opendata.arcgis.com/datasets/3ea21accbfab4ed8b14ede2e802cc2ec_0): 

>This dataset represents Historical North Atlantic and Eastern North Pacific Tropical Cyclone Tracks with 6-hourly (0000, 0600, 1200, 1800 UTC) center locations and intensities for all subtropical depressions and storms, extratropical storms, tropical lows, waves, disturbances, depressions and storms, and all hurricanes, from 1851 through 2008. These data are intended for geographic display and analysis at the national level, and for large regional areas. The data should be displayed and analyzed at scales appropriate for 1:2,000,000-scale data.  


#### Step 2:  View some descriptive statistics  

```{r}
knitr::kable(summary(data %>% select(YEAR, 
                                     MONTH, 
                                     DAY,
                                     WIND_KTS,
                                     PRESSURE)))
```


We can confirm that this particular data had storms from 1851 - 2010, that means the data goes back roughly 100 years before naming storms started! We can also see that the minimum pressure values are 0, which likely means it could not be measured (due to the fact zero pressure is not possible in this case). We can see that there are recorded months from January to December along with days extending from 1 to 31. Whenever you see all of the dates laid out that way, you can smile and think to yourself, "if I need to, I can put dates in an easy to use format such as YYYY-mm-dd (2017-09-12)!"  


#### Step 3: Make a basic plot  


```{r}
df = data %>%
  filter(NAME != 'NOTNAMED' & NAME != 'SUBTROP1') %>%
  group_by(YEAR) %>%
  summarise(Distinct_Storms = n_distinct(NAME))

p = ggplot(df, aes(x = YEAR, y = Distinct_Storms)) + theme_economist()
p + geom_line(size = 1.1) + 
  ggtitle("Number of Storms Per Year") + 
  geom_smooth(method='lm', se = FALSE) + 
  ylab("Storms")
```
  

This is a great illustration of our data set and we can easily notice an upward trend in the number of storms over time. Before we go running to tell the world that the number of storms per year is growing, we need to drill down a bit deeper. This could simply be caused because more types of storms were added to the data set (we know there are hurricanes, tropical storms, waves, etc.) being recorded. However, we should keep it in mind when we start to develop hypotheses.  


**You will notice the data starts at 1950 rather than 1851.** I made this choice because storms were not named until this point so it would be difficult to try and count the unique storms per year. It could likely be done by finding a way to utilize the "ID" columns. However, this is a preliminary analysis so I didn't want to dig too deep.  


#### Step 4: Make some calculations

```{r}
pct.diff = function(x){round((x-lag(x))/lag(x),2)}
act.diff = function(x){round((x-lag(x)),2)}
df = data %>%
  arrange(YEAR) %>%
  filter(NAME != 'NOTNAMED' & NAME != 'SUBTROP1') %>%
  group_by(YEAR) %>%
  summarise(Distinct_Storms = n_distinct(NAME)) %>%
  mutate(Distinct_Storms_Change = act.diff(Distinct_Storms),
         Distinct_Storms_Pct_Change = pct.diff(Distinct_Storms)) %>%
  na.omit() %>%
  arrange(YEAR)
df$YEAR = factor(df$YEAR)
knitr::kable(head(df,10))
```
  
  
In this case, we can see the number of storms, nominal change and percentage change per year. These calculations help to shed light on what the growth rate looks like each year.  So we can use another summary table:  


```{r}
knitr::kable(summary(df %>% select(-YEAR)))
```
  
From the table we can state the following for the given time period:  

  * The mean number of storms is 23 per year (with a minimum of 6 and maximum of 43)
  * The mean change in the number of storms per year is 0.34 (with a minimum of -15 and maximum of 16)
  * The mean percent change in the number of storms per year is 6% (with a minimum of -42% and maximum of 114%)

Again, we have to be careful because these numbers are in aggregate and may not tell the whole story. Dividing these into groups of storms is likely much more meaningful.  


#### Step 5: Make a more interesting plot  


```{r}
df = data %>%
  filter(NAME != 'NOTNAMED' & NAME != 'SUBTROP1') %>%
  filter(grepl("H", CAT)) %>%
  group_by(YEAR,CAT) %>%
  summarise(Distinct_Storms = n_distinct(NAME))
df$CAT = factor(df$CAT)

p = ggplot(df, aes(x = YEAR, y = Distinct_Storms, col = CAT)) + theme_economist()
p + geom_line(size = 1.1) + 
  scale_color_brewer(direction = -1, palette = "Spectral") + 
  ggtitle("Number of Storms Per Year By Category (H)") + 
  facet_wrap(~CAT, scales = "free_x") + 
  geom_smooth(method = 'lm', se = FALSE, col = 'black') +
  theme(axis.text.x = element_text(angle=90), legend.position = 'none') + 
  ylab('Storms')
```
  

Because I was most interested in hurricanes, I filtered out only the data which was classified as "H (1-5)." By utilizing a data visualization technique called "small multiples" I was able to pull out the different types and view them within the same graph. While this is possible to do in tables and spreadsheets, it's much easier to visualize this way. By holding the axes constant, we can see the majority of the storms are classified as H1 and then it appears to consistently drop down toward H5 (with very few actually being classified as H5). We can also see that most have an upward trend from 1950 - 2010. The steepest appears to be H1 (but it also flattens out over the last decade).  


#### Step 5: Make a filtered calculation  

```{r}
df = data %>%
  arrange(YEAR) %>%
  filter(grepl("H", CAT)) %>%
  filter(NAME != 'NOTNAMED' & NAME != 'SUBTROP1') %>%
  group_by(YEAR) %>%
  summarise(Distinct_Storms = n_distinct(NAME)) %>%
  mutate(Distinct_Storms_Change = act.diff(Distinct_Storms),
         Distinct_Storms_Pct_Change = pct.diff(Distinct_Storms)) %>%
  na.omit() %>%
  arrange(YEAR)
knitr::kable(summary(df %>% select(-YEAR)))
```


Now we are looking strictly at hurricane data (classified as H1-H5):  

  * The mean number of hurricanes is 13 per year (with a minimum of 4 and maximum of 24)
  * The mean change in the number of hurricanes per year is 0.05 (with a minimum of -11 and maximum of 10)
  * The mean percent change in the number of hurricanes per year is 8% (with a minimum of -56% and maximum of 180%)  
    
While it doesn't really make sense to say "we got an average growth of 0.05 hurricanes per year between 1950 and 2010" ... it may make sense to say "we saw an average of growth of 8% per year in the number of hurricanes between 1950 and 2010."  

That's a great thing to put in quotes!

> During EDA we discovered an average of growth of 8% per year in the number of hurricanes between 1950 and 2010.  

Be ready, as soon as you make a statement like that, you will likely have to explain how you arrived at that conclusion. That's where having an RMarkdown notebook and data online in a repository will help you out! Reproducible research is all of the hype right now.   
  

#### Step 5: Try visualizing your statements  

```{r}

df = data %>%
  filter(NAME != 'NOTNAMED' & NAME != 'SUBTROP1') %>%
  filter(grepl("H", CAT)) %>%
  group_by(YEAR) %>%
  summarise(Distinct_Storms = n_distinct(NAME)) %>%
  mutate(Distinct_Storms_Pct_Change = pct.diff(Distinct_Storms))

p = ggplot(df,aes(x = Distinct_Storms_Pct_Change)) + theme_economist()

p1 = p + geom_histogram(bins = 20) +
  ggtitle("YoY % Change Density") +
  scale_x_continuous(labels = scales::percent) +
  ylab('') + xlab('YoY % Change in Hurricanes')

p2 = p + geom_density(fill='darkgrey',alpha=0.5) +
  ggtitle("YoY % Change Density") +
  scale_x_continuous(labels = scales::percent) +
  ylab('') + xlab('YoY % Change in Hurricanes')

gridExtra::grid.arrange(p1,p2,ncol=2)
```


A histogram and/or density plot is a great way to visualize the distribution of the data you are making statements about. This plot helps to show that we are looking at a right-skewed distribution with substantial variance. Knowing that we have n = 58 (meaning 58 years after being aggregated), it's not surprising that our histogram looks sparse and our density plot has an unusual shape. At this point, you can make a decision to jot this down, research it in depth and then attack it with full force.  


However, that's not what we're covering in this post.  


#### Step 6: Plot another aspect of your data


```{r}
big_map <- get_googlemap(c(lon=-95, lat=30), zoom = 4, maptype = "terrain")
ggmap(big_map, extent='panel') + 
  geom_point(data = data, mapping = aes(x = LONG, y = LAT),col='red',alpha=0.1)
```


60K pieces of data can get out of hand quickly, we need to back this down into manageable chunks. Building on the knowledge from our last exploration, we should be able to think of a way to cut this down to get some better information. The concept of small multiples could come in handy again! Splitting the data up by type of storm could prove to be invaluable. We can also tell that we are missing

-----

```{r}
df = data %>% filter(grepl("H", CAT))
ggmap(big_map) + 
  geom_density_2d(data = df, mapping = aes(x = LONG, y = LAT), size = 0.5) + 
  stat_density2d(data = df, 
    aes(x = LONG, y = LAT, fill = ..level.., alpha = ..level..), size = 0.1, 
    bins = 20, geom = "polygon") + scale_fill_gradient(low = "green", high = "red", 
    guide = FALSE) + scale_alpha(range = c(0.1, 0.5), guide = FALSE) + 
  facet_wrap(~CAT)
```
  

After filtering the data down to hurricanes and utilizing a heatmap rather than plotting individual points we can get a better handle on what is happening where. The H4 and H5 sections are probably the most interesting. It appears as if H4 storms are more frequent on the West coast of Mexico whereas the H5 are most frequent in the Gulf of Mexico.  


Because we're still in EDA mode, we'll continue with another plot.  


```{r}
df = data %>% filter(!grepl("H", CAT) & !grepl("W", CAT))
ggmap(big_map) + 
  geom_density_2d(data = df, mapping = aes(x = LONG, y = LAT), size = 0.5) + 
  stat_density2d(data = df, 
    aes(x = LONG, y = LAT, fill = ..level.., alpha = ..level..), size = 0.1, 
    bins = 20, geom = "polygon") + scale_fill_gradient(low = "green", high = "red", 
    guide = FALSE) + scale_alpha(range = c(0.1, 0.5), guide = FALSE) + 
  facet_wrap(~CAT)
```


Here are some of the other storms from the data set. We can see that TD, TS and L have large geographical spreads. The E, SS, and SD storms are concentrated further North toward New England.  

Digging into this type of data and building probabalistic models is a fascinating field. The actuarial sciences are extremely difficult and insurance companies really need good models. Having mapped this data, it's pretty clear you could dig in and find out what parts of the country should expect what types of storms (and you've also known this just from being alive for 10+ years). More hypotheses could be formed about location at this stage and could be tested!  


#### Step 7: Look for a relationship


```{r}
df = data %>% 
  filter(PRESSURE > 0) %>%
  filter(grepl("H", CAT)) %>%
  group_by(CAT,YEAR,MONTH,DAY,LAT,LONG) %>%
  summarise(MEAN_WIND_KTS = mean(WIND_KTS), MEAN_PRESSURE = mean(PRESSURE)) %>%
  arrange(MEAN_WIND_KTS)
df$CAT = factor(df$CAT)

p = ggplot(df,aes(x=MEAN_WIND_KTS, y = MEAN_PRESSURE, fill = CAT)) + theme_economist()
p + 
  geom_hex(alpha = 0.8) +
  scale_fill_brewer(direction = -1, palette = "Spectral") + 
  scale_y_continuous(labels = scales::comma)+ 
  theme(legend.position = 'right') + 
  ggtitle("Wind KTS vs. Pressure by Category (H)")
```
  
   
What is the relationship between WIND_KTS and PRESSURE? This chart helps us to see that low PRESSURE and WIND_KTS are likely negatively correlated. We can also see that the WIND_KTS is essentially the predictor in the data set which can perfectly predict how a storm is classified. Well, it turns out, that's basically the distinguising feature when scientists are determining how to categorize these storms!  


#### Step ........

The rest is up to you! This is a great data set and there are a lot more pieces of information lurking within it. I want people to do their own EDA and send me anything interesting! 

Some food for thought:  

  * What was the most common name for a hurricane?
  * Do the names actually follow an alphabetical pattern through time? (This is one is tricky)
  * Can we merge this data with FEMA, charitable donations, or other aid data?

  
To get you started on the first one, here's the Top 10 most common names for tropical storms. Why do you think it's Florence?


```{r}
top_names = data %>%
  filter(NAME != 'NOTNAMED' & NAME != 'SUBTROP1') %>%
  group_by(NAME) %>%
  summarise(Years_Used = n_distinct(YEAR)) %>%
  arrange(-Years_Used)
p = ggplot(top_names %>% top_n(10), aes(x = reorder(NAME, Years_Used), y = Years_Used)) + theme_economist()
p + geom_bar(stat='identity') + coord_flip() + xlab('') + ggtitle('Most Used Tropical Storm Names')
```


Thank you for reading, I hope this helps you with your own data. The code is all written in R and is located on my [GitHub](https://github.com/stoltzmaniac/Data-Visualization-Lesson). You can also find other data visualization posts and usages of ggplot2 on my blog [Stoltzmaniac](https://www.stoltzmaniac.com?utm_campaign=bottom_of_tropical_storm_post)