-
Notifications
You must be signed in to change notification settings - Fork 1
ska distance
Simon Harris edited this page Apr 24, 2018
·
28 revisions
The distance subcommand allows calculation of pairwise distances between split kmer files and clustering based user-defined SNP and identity cutoffs. Distance and cluster output files will only be created if a file name is specified. At least one must be specified.
The clustering method employed is very simple. files are clustered if they meet the requirements of both of the following two cutoffs
- If the number of SNPs between them is less than the SNP cutoff [Default = 20] and
- They meet the identity cutoff [Default = 0.9]. I.e. they share at least this proportion of the total number split kmers in the file with fewer kmers.
Column | Description |
---|---|
File 1 | The name of the first split kmer file being compared |
File 2 | The name of the first split kmer file being compared |
Matches | Number of split kmers found in both files where the middle base is an A, C, G or T and matches between files |
Mismatches | Number of split kmers found in only one of the files |
SNPs | Number of split kmers found in both files where the middle base is an A, C, G or T but differs between files |
Ns | Number of split kmers found in both files where the middle base is an N in at least one of the files |
Column | Description |
---|---|
File | The name of the split kmer file |
Cluster | An index for the cluster containing the file |
ska distance [options] <split kmer files>
Options:
-c <file> Clusters output file name (tsv format).
-d <file> Distances output file name (tsv format).
-h Print this help
-f <file> File of split kmer file names. These will be added to or
used as an alternative input to the list provided on the
command line.
-i <float> Identity cutoff for defining clusters. Isolates will be
clustered if they share at least this proportion of the
kmers of the isolate with fewer kmers and pass the SNP
cutoff.
-s <int> SNP cutoff for defining clusters. Isolates will be clustered
if they are separated by fewer than this number of SNPs and
pass the identity cutoff
SKA is currently only available as a preprint, so for now, if you use it, please cite: Harris SR. 2018. SKA: Split Kmer Analysis Toolkit for Bacterial Genomic Epidemiology. bioRxiv 453142 doi: https://doi.org/10.1101/453142