blast_practical.Rmd

---
title: "BLAST practical"
author: "Philip Ashton"
date: '`r format(Sys.time(), "%d %B, %Y")`'
---

<style>
body {
text-align: justify}
</style>

```{r settings, include = FALSE}
switch <- FALSE
```

# Learning objectives {-}
To become familiar with:  

1. The mechanics of running BLAST searches  

2. How to interpret BLAST results 

3. Ways in which homology can complicate the interpretation of BLAST results  

4. How BLAST, combined with the use of the extra information in graph representation of genome assemblies can aid in the localisation of AMR determinants to genomic elements

# Exercise 1 {-}

In this exercise you will be given 6 assembled *Salmonella* genomes. You need to identify which serotype they are based on the results of BLAST analysis against a custom database of *Salmonella* antigen genes, and cross checking the results with (a shortened version of) the Kauffman-White scheme.

1. In the terminal, navigate to the `/home/ubuntu/data/day-3/blast-practical` directory, you should see 6 *Salmonella* reference genomes (fasta format) labelled `Salm_1` to `Salm_6` and a `salmonella_antigens.fasta` file. In order to run BLAST, we need a query and a database. In this case, the database is going to be the `salmonella_antigens.fasta` file, and the queries are going to be the 6 *Salmonella* assemblies. For simplicity, we will process them one at a time.

2.  First we need to process the FASTA which will be our database so that it is compatible with BLAST. For this we use the `makeblastdb` program. -

```{bash, eval = FALSE}
makeblastdb -dbtype nucl -in salmonella_antigens.fasta
```
  
**Question 1:** How many sequences are there in the `salmonella_antigens.fasta` file? What are their names?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here...">
</textarea>

<div class="toggle"><button>Hint</button>
Since we know that each fasta record has a header, and each header starts with a '>'. We can use `grep ‘>’ salmonella_antigens.fasta` to show all the sequence headers in the file. **CAREFUL** - make sure to include the quotation marks around the > in that command, or you will overwrite the contents of `salmonella_antigens.fasta`.
</div><br>

3.1. Now, before we BLAST our query against the database we just made, we should decide which output format to use. Information on the output formats is available via the `-help` option. Note that we use `blastn` here as we are comparing nucleotides. Run:

```{bash, eval = FALSE} 
blastn -help
```

**Question 2:** Look at the `*** Formatting options` section of the `blastn -help` output. We are going to load our data into Excel; which do you think is the most suitable `-outfmt` option to use?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

<div class="toggle"><button>Hint</button>
If you aren't sure which is the best format, move onto section 3.2 and experiment with different -outfmt options.
</div><br>

3.2. Now we are going to run our first BLAST. Enter the below command, where XXX is the outfmt option you chose in section 3.1

<pre class="bash">
blastn -query Salm_1.fasta -db salmonella_antigens.fasta -outfmt XXX -out Salm_1_vs_salmonella_antigens.tsv
</pre>

3.3. Then, examine the output file using `head`.

<pre class="bash">
head Salm_1_vs_salmonella_antigens.tsv
</pre>

**Question 3:** How many columns does the output have?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

**Question 4:** How many lines of the file has `head` shown you? How do you only show the top 4 lines of the file?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

<div class="toggle"><button>Hint</button>
Run `head --help` for more info.
</div><br>

**Question 5:** What is separating the columns?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

<div class="toggle"><button>Hint</button>
What does the T in tsv stand for?
</div><br>

**Question 6:** What happens if you run the command without the `-outfmt` option, or with different `-outfmt` options?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

<div class="toggle"><button>Hint</button>
Run the command with different outfmt options and look at the output file using `head` or `cat`.
</div><br>

3.4. Run blastn for each genome against the `salmonella_antigens.fasta` database. Remember to change the name of the output file to contain the name of the input genome assembly. I.e. the results for Salm_1 should contain the name salm1.

3.5.  Make sure that each of your output files (e.g. `Salm_1_vs_salmonella_antigens.tsv`) is in the best format for viewing in Excel before proceeding.

**Question 7:** What kind of `blast` would we do if we were comparing proteins against `Salm_1.fasta`? What kind of BLAST would we do if we were comparing `salmonella_antigens.fasta` against `Salm_1.fasta`?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

4. Now, we are going to look at the BLAST output using Excel. For this, the output files (e.g. `Salm_1_vs_salmonella_antigens.tsv`) need to be on your local computer rather than your CLIMB instance.

4.1. Use Filezilla to move the output files onto your local computer and open them one at a time using Excel.

<div class="alert alert-info">
You also need to download the `Salmonella serotypes.xlsx` file to your local computer.
</div>

4.2. You might notice that `blastn` does not include column names in the output, these are the column headings, paste them into row 1 in Excel.

```{r table2, echo=FALSE, message=FALSE, warnings=FALSE, results='asis'}
tabl <- "

| Columns         | 
|-----------------|
| query_id        | 
| subject_id      |
| pct_identity    |
| aln_length      | 
| n_of_mismatches |
| gap_openings    |
| q_start         |
| q_end           |
| s_start         | 
| s_end           |
| e_value         |
| bit_score       |"

cat(tabl) # output the table in a format good for HTML/PDF/docx conversion
```
<div class="toggle"><button>Hint</button>
If you are having trouble with changing the orientation of the headers, use Paste Special -> transpose in Excel.
</div><br>

**Question 8:** What do the different column headings mean?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

<div class="toggle"><button>Hint</button>
Use Google to find out. For example, [here](http://www.metagenomics.wiki/tools/blast/blastn-output-format-6) or [here](https://www.ncbi.nlm.nih.gov/books/NBK279690/)
</div><br>

**Question 9:** How many BLAST hits are there for `Salm_1` vs `salmonella_antigens?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

**Question 10:** Look at the help for `blastn`, can you figure out some additional command line parameters you could enter which would reduce the number of results you have to look at? What are the potential downsides of applying these additional parameters?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

**Question 11:** Which is the best column to sort by so that you see the best hits first? Sort the data by this column.

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

**Question 12:** There are multiple different alleles of *fljB* and *fliC* in the BLAST database. Look at your BLAST results to identify the **best** *fljB* and *fliC* hits for each genome, and then look at the different serotypes in the Salmonella Serotypes excel file to identify which serotype has that combination of antigenic genes. Do this for all 6 genomes.

<div class="alert alert-info">
Reminder: *fliC* is H1 and *fljB* is H2.
</div>

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

<div class="toggle"><button>Hint</button>
One of the genomes might not have both H1 and H2.
</div><br>

6. Take the BLAST results for Salm_6. Using Excel, filter the BLAST results so that only hits where the subject was either the ‘best’ `fljB` or ‘best’ `fliC`. You should have `6` blast hits.

**Question 13:** Why are there 6 results?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

<div class="toggle"><button>Hint</button>
Look at the query start and query end values for the `fljB` and `fliC` hits.
</div><br>

<div class="toggle"><button>Hint</button>
There is an area of sequence similarity between *fljB* and *fliC*.
</div><br>

**Question 14:** What is another way you could tell which serotype the genomes are? The above method, looking at the genes which determine the serotype could be described as the ‘biologist’ way of serotyping from the genome, what might the ‘computer scientist/bioinformatician’ way be?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

<div class="toggle"><button>Hint</button>
If a biologist uses the genes underlying a 'traditional' microbiological property like antigenic structure to identify the serotype, what might someone who knows nothing about the genes underlying the phenotype do?
</div><br>

# Exercise 2 {-}

In exercise 2, you will use BLAST within the Bandage genome graph visualiser to determine whether antibiotic resistance determinants are carried on plasmids or chromosomes. 

1. You will start by carrying out some basic data exploration on the command line of your remote instance. Locate the files in `/home/ubuntu/data/day-3/blast-practical`. Below is the output of the following command `bioawk -c fastx '{ print $name, length($seq) }' < genes_and_plasmids.reduced.fasta`

![bioawk output](https://i.imgur.com/de7ynJY.png)

**Question 1:** What are the longest and shortest sequences in `genes_and_plasmids.fasta`?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

<div class="toggle"><button>Hint</button>
Use google to tell you how to do this. [this answer](https://www.biostars.org/p/118954/#119009) works well 
</div><br>

**Question 2:** What are the differences between the sequences which have a name beginning with ‘p’ (i.e. lowercase p) and those which don’t? This is a question about your knowledge of microbiology naming conventions.

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

**Question 3:** Either using Google or your microbiology knowledge, which AMR genes in the file are typically plasmid-borne and which are typically chromosomal?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

<div class="toggle"><button>Hint</button>
Use `grep ‘>’ genes_and_plasmids.fasta` to find the names of the sequences in the file.
</div><br>

2. Using filezilla, download the 3 FASTG files and genes_and_plasmids.fasta from `/home/ubuntu/data/day-3/blast-practical` to your local computer

3. Load the first sample into Bandage following the instructions from the [Bandage github page](https://github.com/rrwick/Bandage/wiki/Getting-started#load-a-graph)

**Question 4:** What are the characteristics of this assembly in terms of the length, the N50, etc?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

**Question 5:** Do you think the assembly used an appropriate range of k-mer sizes? What other datasets do we need to get a better answer to this question?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

<div class="toggle"><button>Hint</button>
Read [this](https://github.com/rrwick/Bandage/wiki/Effect-of-kmer-size)
</div><br>

**Question 6:** Can you see any sequences which look like they could be plasmids?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

6. Prepare the FASTG file for BLAST analysis by following the instructions from [here](https://github.com/rrwick/Bandage/wiki/BLAST-searches)

<div class="alert alert-info">
You need to install BLAST so that it can be used by Bandage. For this, download BLAST from [here](ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.8.1+-win64.exe), and place it in the same directory as the Bandage executable. This will **probably** be in the download directory.
</div>

7. BLAST the `genes_and_plasmids.fasta` against the genome loaded into Bandage. You can reduce the view to include only blast hits by changing the 'scope' drop down in the 'graph drawing' section, select 'around blast hits', change the distance to e.g. 2 or 3, and click draw graph. The distance controls how many contigs will be included in the view. Setting distance to 3 means that all contigs which are within 2 or 3 'jumps' of a contig with a blast hit will be included. Experiment with this parameter to see what effect having it at e.g. 0 and 10 has.

8. Inspect the genome graph and the BLAST hits, explore the different options Bandage has for ‘scope’, zoom in and out, Node labels, Font etc.
    
**Question 7:** For each genome, which plasmids and AMR genes are present in the sample?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

<div class="toggle"><button>Hint</button>
It might be easier to view the BLAST results via the ‘Create/view BLAST search results’
</div><br>

**Question 8:** If a plasmid is present, what does the ‘depth’ node label (combined with the graph structure) tell you about repeat sections in the plasmid?

**Question 9:** In the assembly SRR5451282, what is interesting about the CTX-M15 encoding gene?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

<div class="toggle"><button>Hint</button>
blast every contig linked to the contig which encodes ctx-M15, you can do this by copying the node sequence or by using BLAST directly through Bandage via the Output tab. 
</div><br>

**Bonus Question 10:** why do you think OmpD is included in the genes_and_plasmids.fasta?

<textarea id="name" name="name" cols="100" rows="3" placeholder="You can input your answer here..."></textarea>

# Bonus Exercise {-}

If you have finsihed all the above exercises in good time, then feel free to have a go at this exercise, carrying out a Needleman-Wunsch alignment by hand. You can use paper or Excel, whichever you prefer.

1. Establish a scoring system - e.g. match = 1, mismatch = -1, gap = -1
2. You will be aligning two sequences `GAATTAA` and `GATTACA`. Create a scoring matrix like the below:   
![Scoring matrix](https://imgur.com/7JAQ6c9.png)
3. Fill in the table according to the instructions [here](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm#Filling_in_the_table)
4. Trace back the arrows from the bottom right cell to the top left following the rules [here](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm#Tracing_arrows_back_to_origin)
5. Check your answer by putting the sequences into [this](http://experiments.mostafa.io/public/needleman-wunsch/) website

<script>
  $(".toggle").click(function() {
    $(this).toggleClass("open");
  });
</script>