Skip to content

ska distance

Simon Harris edited this page Sep 4, 2018 · 28 revisions

SKA distance

The distance subcommand allows calculation of pairwise distances between split kmer files and single-linkage clustering based on user-defined SNP and identity cutoffs.

Clustering cutoffs

The clustering method employed is very simple. Samples are clustered if they meet the requirements of both of the following

  1. If the number of SNPs between them is less than the SNP cutoff [Default = 20] and
  2. They meet the identity cutoff [Default = 0.9]. I.e. they share at least this proportion of the total number split kmers in the file with fewer kmers.

Distance output columns

Column Description
File 1 The name of the first split kmer file being compared
File 2 The name of the first split kmer file being compared
Matches Number of split kmers found in both files where the middle base is an A, C, G or T and matches between files
Mismatches Number of split kmers found in only one of the files
Identity Proportion of split kmers found in both files
SNPs Number of split kmers found in both files where the middle base is an A, C, G or T but differs between files
Ns Number of split kmers found in both files where the middle base is an N in at least one of the files

Cluster output columns

Column Description
File The name of the split kmer file
Cluster__autocolour An index for the cluster containing the file

Note: The __autocolour suffix to the Cluster column is to allow automatic colouring when the file is opened in MicroReact

Usage

ska distance [options] <split kmer files>

Options:
-c 		Do not print clusters files.
-d 		Do not print distances file.
-h		Print this help.
-f <file>	File of split kmer file names. These will be added to or 
		used as an alternative input to the list provided on the 
		command line.
-i <float>	Identity cutoff for defining clusters. Isolates will be 
		clustered if they share at least this proportion of the 
		split kmers in the file with fewer kmers and pass the SNP 
		cutoff. [Default = 0.9]
-o <file>	Prefix for output files. [Default = distances]
-s <int>	SNP cutoff for defining clusters. Isolates will be clustered 
		if they are separated by fewer than this number of SNPs and 
		pass the identity cutoff. [Default = 20]
Clone this wiki locally