for_loops_and_apply_fam_master.Rmd

---
title: "For loops and the apply family"
author: "Stephanie Thiede"
date: "11/15/2019"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

### 1. R basics - vectors, matrices, lists, logicals 

#### A. Vectors - 1D arrays: 

##### Initialize a vector 
Two examples of how to initialize a vector 
```{r}
my_vec = c(1,2,3,4,5,6,7,8,9,10)
my_vec

my_vec = 1:10 
my_vec
```

##### Initialize an empty vector 
How to initialize an empty vector, to store things in later 
```{r}
my_vec = vector(mode = 'numeric', length = 10 )
my_vec
```

#### B. Matrices/data frames - 2D arrays: 

##### Initialize a matrix
Initialize a matrix called my_mat
```{r}
my_mat = matrix(1:10, ncol = 2, nrow = 5)
my_mat
class(my_mat)
```

Name the rows and columns using colnames() and row.names()
```{r}
colnames(my_mat) = c('A', 'B')
row.names(my_mat) = c('Sample1', 'Sample2', 'Sample3', 'Sample4', 'Sample5')

my_mat 
```

Get the dimensions of my_mat. Dimensions are reported as c(row, column) where numbers of rows in the first bin, and number of columns in the second bin
```{r}
dim(my_mat)
```

Can also ask for the number of rows and number of columns directly with ncol() and nrow()
```{r}
ncol(my_mat)
nrow(my_mat)
```

Can see the column names and row names of my_mat with rownames() and colnames()
```{r}
#get the column names of my_mat 
colnames(my_mat)

#get the row names of my_mat
rownames(my_mat)
```

##### Access rows/columns in matrices 
Format is: my_mat[row, column]

Can access with numeric indices 
```{r}
# row 2, column 1
my_mat[2,1]
```

Can access with row names and column names 
```{r}
# row 2, column 1
my_mat['Sample2','A']
```

Get all rows, 2nd column. Don't specify a row, just use a comma
```{r}
# all rows, 2nd column
my_mat[,2]
```

Get all columns, 1st row. Don't specify a column, just use a comma
```{r}
# all columns, 1st row
my_mat[1,]
```

##### C. Data frames 

Initialize a data frame 
```{r}
my_df = data.frame(A = c(1,2,3,4,5), B = c(6,7,8,9,10), row.names = c('Sample1', 'Sample2', 'Sample3', 'Sample4', 'Sample5'))

my_df

# get the class -- data.frame 
class(my_df)
```

Indexing dataframes is the same as matrices, but you can also index the columns using the $. 

3 ways to access column A, the first column 
```{r}
my_df[,'A']
my_df[,1]
my_df$A #can't do this with matrices! 
```

#### D. Lists
Each element of a list can contain vectors, matrices, lists, etc. of any shape or size. 

Here, we have a list where the first element will be a vector, the second is a matrix, and the third is a data.frame 
```{r}
my_list = list(my_vec, my_mat, my_df)

my_list
```

##### Accessing element of a list 
Use double square brackets to access a list element. Here, we're accessing the first element of my_list.
```{r}
my_list[[1]]
```


##### Acess content with a list element 

Access the second element of the list  
```{r}
my_list[[2]]
```

Then you can subset it with matrix subsetting (since its a matrix)
```{r}
my_list[[2]][1,1]
my_list[[2]]['Sample1','A']
```

##### Subsetting a list
Subset a list with single square brackets. This will make your list a length of 2 -- containing only the 1st and 3rd elements of my_list()

```{r}
my_list[c(1,3)]
```


#### E. Subsetting a data frame with logicals 

Say, your samples in my_df came from the following sources: either dog or cat as such. 
```{r}
sample_source = c('cat', 'dog', 'cat', 'cat', 'dog')
```

See how the length of sample_source is the same as the number of rows in my_df -- as each sample is either from "dog" or "cat"
```{r}
length(sample_source)
dim(my_df)
```

Create a logical vector -- ask "where does sample source equal cat". Will give TRUE when sample_source is cat and FALSE when sample_source is dog.

```{r}
sample_source == 'cat'
```

See that the class is a logical 
```{r}
class(sample_source == 'cat')
```

We can subset the column 'A' in my_df on cat samples using the logical vector. 
```{r}
my_df$A[sample_source == 'cat']
```

Similarily, we can subset my_df on the dog samples using another logical vector -- and use comma to indicate we want all the columns. 
```{r}
my_df[sample_source == 'dog',]
```


### 2.1 Iterating through vectors 

### 2.1.1 Iterating through vectors with for loops 
#### A. Structure of a for loop 

for (thing in sequence){

  do thing
  
}

#### B. Examples 

Initialize vector of colors 
```{r}
colors = c('red', 'blue', 'green')
```

Create function my_fav_col() which takes as input a string and outputs another string "My favorite color is ____"
```{r}
my_fav_col <- function(color){
  return(paste('My favorite color is', color))
}
```


Iterating through indexing 
```{r}
for (i in 1:length(colors)){ 

  print( my_fav_col(colors[i]) )
  
}
```

Iterating through a vector called colors 
```{r}
for (thing in colors){ 

  print( my_fav_col(thing) )
  
}
```

### 2.1.2 Iterating through vectors with sapply()

Pass each element of colors to our function, my_fav_col

```{r}
sapply(colors, my_fav_col)
```

### 2.2 Iterating through matrices 

Recall my_mat 
```{r}
my_mat
```

Our goal is to get the mean of each row. Let's do so with both a for loop (2.2.1) and an apply statement (2.2.2)

### 2.2.1. Iterating through MATRICES with for loops 

```{r}
for (i in 1:nrow(my_mat)){
  row = my_mat[i,]
  print(mean(row))
}
```

### 2.2.2. Iterating through MATRICES with apply()
```{r}
apply(my_mat, 1, mean)
```


### 3. Vectorized vs. sapply vs. for loop 

Say we have 100,000 circle radii (1-100,000) and we want to calculate the area of each circle. 
```{r}
# Vector of circle radii 
radii = 1:100000
```

Here's our function to calculate the area of a circle: 
```{r}
# Function for area of a circle 
area_of_circle <- function(r){
  return(pi * r^2)
}
```

Here's an example of running our function on one radius -- a radius of 10 
```{r}
# Example of using the function 
area_of_circle(r = 10)
```


Let's see how much time it takes to use 1) vectorized R code 2) sapply() statement versus 3) a for loop. We'll run it N times to compare the trends. 
```{r}
# initialize vectors to store run time 
repeats = 40
time_vec = vector('numeric', length = repeats)
time_sapply = vector('numeric', length = repeats)
time_for = vector('numeric', length = repeats)

```

#### A. Vectorized
```{r}
for (i in 1:repeats){

  start = Sys.time()
  areas_method1 = area_of_circle(radii)
  end = Sys.time()
  
  time_vec[i] = as.numeric(end - start)
}
```


#### B. sapply() 
```{r}
for (i in 1:repeats){
  
  start = Sys.time()
  areas_method2 = sapply(radii, area_of_circle) 
  end = Sys.time()
  
  time_sapply[i] = as.numeric(end - start)
}

```

#### C. for loop 
```{r}

for (j in 1:repeats){
  
  start = Sys.time()
  
  areas_method3 = vector(mode = 'numeric', length = length(radii))
  
  for (i in 1:length(radii)){
    areas_method3[i] = area_of_circle(radii[i])
    }
  
  end = Sys.time()

  time_for[j] = as.numeric(end - start)
}

```

#### D. Comparing speed for vectorized code, for loop, apply
```{r}
max_val = max(c(time_vec, time_sapply, time_for)) + 0.01

hist(time_vec, col = rgb(1,0,0,0.5), breaks = seq(0,max_val, 0.01), xlim = c(0,max_val), 
     main = 'Comparing speed for vectorized code, for loop, apply', xlab = 'Run time (seconds)')
hist(time_sapply, col = rgb(0,1,0,0.5), breaks = seq(0,max_val, 0.01),add = TRUE)
hist(time_for, col = rgb(0,0,1,0.5), breaks = seq(0,max_val, 0.01),add = TRUE)
legend('topright', fill = c('red', 'green', 'blue'), legend = c('vectorized', 'sapply', 'for loop'))

```


To summarize: 

- vectorized code is faster than sapply() and for loops 

- for loops and sapply() are similar in speed, depending on the length of the vector and the task

- for loops completes same task in multiple lines of code, vectorization & sapply in 1 line of code so you might want to use an apply family function for clarigy 

Let's explore with examples. 

### 4. Looping over a vector or list: sapply() or lapply()

Read in TV ratings data 
```{r}
tv_data = read.csv('data/IMDb_Economist_tv_ratings.csv', stringsAsFactors = FALSE)
```

Let's take a look at the TV ratings data 
```{r}
head(tv_data, 15)
```

Say we wanted to figure out what the max number of seasons was for each show. We could approach this with a for loop or an sapply. Let's do both and compare. 

For loop. 
```{r}
#initialize vector to store max season number in
max_season = rep(NA, length(unique(tv_data$title))) 

# loop through indices 1 to length of the unique TV show titles 
  # access the show name (show)
  # get all the seasons for the show (all_season_numbers)
  # get the max season number and store in max_season[index]

#name storage vector to be the show title 
for (i in 1:length(unique(tv_data$title))){
  show = unique(tv_data$title)[i]
  all_season_numbers = tv_data$seasonNumber[tv_data$title == show]
  max_season[i] = max(all_season_numbers)
}
names(max_season) = unique(tv_data$title)

```

```{r}
head(max_season)
```

sapply version 
```{r}
max_season = sapply(unique(tv_data$title), function(show){
   all_season_numbers = tv_data$seasonNumber[tv_data$title == show]
   max(all_season_numbers)
})

```

#### sapply() will output your data in the simplest form it can. Here, we see if outputted a vector because we were trying to access 1 value (the max season number) from each tv show title

```{r}
head(max_season)
```


```{r}
hist(max_season, breaks = seq(0,45,1), main = 'Number of seasons per show', xlab = 'Number of seasons')
```

#### Let's explore some other ways sapply() can output your data. Here, let's find the average rating of each season of a show 
```{r}
avg_rating_per_show = sapply(unique(tv_data$title), function(show){
  tv_data$av_rating[tv_data$title == show]
})
head(avg_rating_per_show)
```

#### You can see sapply() outputted a list because each tv show had a variable number of seasons and thus a variable number of ratings. Thus, the simplest data structure was a list. 


#### Let's look at anther example -- Here, we will first subset on those with 2 seasons (or rather, 2 entries in the dataset):

Use an sapply() to figure out how many seasons/entries each show has 
```{r}
num_seasons = sapply(unique(tv_data$title), function(show){
  length(tv_data$seasonNumber[tv_data$title == show])
})
```

Subset on the show titles with 2 seasons/entries 
```{r}
two_seasons = unique(tv_data$title)[num_seasons == 2]
```

Find the average rating for the shows with two seasons 
```{r}
avg_rating_two_seasons = sapply(two_seasons, function(show){
  (tv_data$av_rating[tv_data$title == show])
})

head(t(avg_rating_two_seasons))
```
#### Because every show had 2 data points, a matrix was the simplest output 

#### We can control output with lapply() -- lapply() will ALWAYS output a list. 
```{r}
avg_rating_two_seasons = lapply(two_seasons, function(show){
  (tv_data$av_rating[tv_data$title == show])
})

head(avg_rating_two_seasons)
```


### We can use tapply() to summarize data 
Here, we want to know what is the average rating per show across all seasaons 
```{r}
avg_rating_per_title = tapply(tv_data$av_rating, tv_data$title, mean)
head(avg_rating_per_title)
```

### 5. Looping over a matrix/data.frame: apply()
Now let's play with apply statements with another dataset -- bacterial growth curve data. 

#### Growth curve example 

```{r}
growth = read.csv('data/mock_growth_curve_data.csv', row.names = 1)
```

Let's take a look at how the data is organized. We have 8 wells growing bacteria, where each well is a column. Rows indicate time, and time was taken every 15 minutes. 
```{r}
head(growth)
```


Say we want to take the mean at each timepoint. We can use apply, loop over the rows (1), and use the function mean()
```{r}
mean_growth = apply(growth, 1, mean)
```

```{r}
plot(row.names(growth), mean_growth, type = 'l', ylab = 'OD', xlab = 'Time (min)', main = 'Average growth')
```


We can also plot our curves with apply. Here, we can plot each well on its own plot.  
```{r}
apply(growth, 2, function(col){
  plot(row.names(growth), col, type = 'l', ylab = 'OD', xlab = 'Time (min)')
})
```

We can also plot all curves on the same plot 
```{r}
plot(row.names(growth), growth$Well1, type = 'l', ylab = 'OD', xlab = 'Time (min)')
apply(growth, 2, function(col){
  lines(row.names(growth), col)
})
```

It looks like there are two growth patterns. Let's explore by plotting the max difference at each time point. 
```{r}
max_difference = apply(growth, 1, function(row){
  max(row) - min(row)
})
```


```{r}
plot(row.names(growth), max_difference, type = 'l', ylab = 'max OD difference', xlab = 'Time (min)')
```


### 6. Faster looping with future.apply 
Here's a [Data Camp tutorial](https://campus.datacamp.com/courses/parallel-programming-in-r/foreach-futureapply-and-load-balancing?ex=9) to learn more about paralell programming in R -- specifically future and future.apply()