sheepTSexample

Course Schedule (liable to change)

Setup

(Optional) To run the simulation script, you need to set up scrm
For the simulation script and data analysis, you need an environment with tsinfer and numpy, e.g. conda create -n tsinfer -conda-forge tsinfer numpy ipython

Data

There is a simulated data set of 105 sheep (i.e. 210 haploid genomes) along 1Mbp of sequence:

5 mouflon
50 Iranian
50 Border

Demography (N - effective population size, sampled - number of diplopids sampled) following the work of Pedro Morell Miranda et al. (2023; Demographic reconstruction of the Western sheep expansion from whole-genome sequences, https://academic.oup.com/g3journal/article/13/11/jkad199/7262565)

VCF vs Tree Sequence (TS)

Both files have the same variant data, but the TS is only half the size. This ratio becomes even more beneficial for larger data sets. The TS file also contains richer information (trees).

-rw------- 1 hbecher datastore_eb_groups_HighlanderLab_admin 2.5M Feb 20 11:58 test1M.out.inf.ts
-rw------- 1 hbecher datastore_eb_groups_HighlanderLab_admin 5.6M Feb 20 11:58 test1M.out.inf.vcf

Analysing data in TS format

Open ipython, Jupyter notebook or similar

import tskit
# load the TS file
ts = tskit.load("test1M.out.inf.ts")

# how amny sample are there?
ts.num_samples

# samples 0 to 9 are mouflon genomes, 10 to 109 are Iranian, 110 to 209 are Border, mind the python indexing below
# compute nucleatide diversity in mouflon, Iranian, and Border
ts.diversity([range(10), range(10,110), range(110, 210)])

# Fst mouflon-Iranian
ts.Fst([range(10), range(10,110)])
> 0.0474287776006701

# Fst mouflon-Border
ts.Fst([range(10), range(110,210)])
> 0.3863240194680754

# Fst Iranian-Border
ts.Fst([range(10,110), range(110,210)])
> 0.344373582032233

# Based on gene trees and looking at the mouflon samples (first argument) who are the closest relatives (second argument = list of ranges)
ts.genealogical_nearest_neighbours(range(10), [range(10),range(10, 110), range(110, 210)])
> array([[0.59771033, 0.36771007, 0.0345796 ],
>        [0.48813082, 0.48295405, 0.02891513],
>        [0.51328397, 0.44918785, 0.03752818],
>        [0.74977978, 0.23451288, 0.01570734],
>        [0.63255684, 0.35004292, 0.01740025],
>        [0.48170299, 0.49188582, 0.02641118],
>        [0.51438889, 0.43455347, 0.05105764],
>        [0.65187813, 0.32764667, 0.0204752 ],
>        [0.46132971, 0.48903879, 0.0496315 ],
>        [0.60860476, 0.35239091, 0.03900433]])
# The columns are mouflon, Iranian, Border
# The rows correspond to the mouflon genomes
# Mouflon genomes are most similar to mouflon, also simlar to Iranian, but very different to Border
# This is useful when there are unlabelled samples

See here for other stats and inspiration.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Notebooks		Notebooks
.gitignore		.gitignore
01runScrm1M.sh		01runScrm1M.sh
02inferTs.py		02inferTs.py
LICENSE		LICENSE
README.md		README.md
recipe_17.5.trees		recipe_17.5.trees
test1M.out		test1M.out
test1M.out.inf.ts		test1M.out.inf.ts
test1M.out.inf.vcf		test1M.out.inf.vcf
tiny_ts.ts		tiny_ts.ts
ts1		ts1
ts2		ts2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sheepTSexample

Course Schedule (liable to change)

Setup

Data

VCF vs Tree Sequence (TS)

Analysing data in TS format

About

Releases

Packages

Contributors 3

Languages

License

HighlanderLab/sheep_ts_demo

Folders and files

Latest commit

History

Repository files navigation

sheepTSexample

Course Schedule (liable to change)

Setup

Data

VCF vs Tree Sequence (TS)

Analysing data in TS format

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages