Skip to content

Latest commit

 

History

History

tcr

The set of Python code described in Riaz et al. (2017) performs three basic operations on the tabulated repertoires – containing unique CDR3 clonotypes and their respective abundances – provided as output by the Immunoseq v2.0 (Adaptive Biotechnologies) proprietary processing pipeline (compressed as /data/tcr/Immunoseq.tgz):

  1. Filtration of unproductive CDR3 sequences
  2. Parsing of CDR3 abundance data to the resolutions of: V-J combinations (VJ), amino acid CDR3 motifs (aaCDR3), and VJ/aaCDR3-unique clonotypes (totCDR3)
  3. Calculation of entropy and other diversity statistics on each repertoire as described at each of these resolutions

The output of these functions are provided in a Microsoft Excel sheet (TCRmetrics_Riaz2017.xlsx) and its .csv version TCRmetrics.csv.

An additional R script ("createTCRfigures.R" is provided to automate the graphical output in the manuscript figures.

Python functions

The pipeline employs three primary Python programs:

JS_splitter_Imsqv4.5.py [filtertype] [filename] – parses Immunoseq (*.tsv) output containing 46 columns into a repertoire tabulation and resolutions (output files [samplename]_VJ.out, [samplename]_aaCDR3.out, [samplename]_totCDR3.out, [samplename]_ntCDR3.out).

JS_entropy_v7.7.py [filename1] [filename2] – performs entropy calculations and entropy comparisons on the “complete” repertoire files pairwise, to produce output file [samplename]_H_CL_JS.out. This code is functionally identical to that by the same name in https://github.com/ShenLab/Repertoire, as published by Sims et al. (2016).

JS_aaCDR3perVJ_v2.5.py [filename] – parses the amino acid CDR3 motifs encoded by each V-J combination, and outputs the number, entropy (H), and normalized entropy (Hnorm) of the aaCDR3 repertoire encoded by each V-J combination.

and one file handling script:

linepick_v7.7.2.py – which simply extracts single-sample statistics and paired-sample statistics from the output of JS_entropy_v7.7.py and assembles them into summary files for groupwise analysis. Current version is filename dependent and can be replaced by a more flexible input function, or shell scripting.

Additionally, we used:

multiple_joint_kde.py, available through the Seaborn visualization library for python (https://github.com/mwaskom/seaborn).

Implementation

Place the python scripts in the working directory containing tabulated repertoire files, such as those provided for export by Adaptive for the Immunoseq sequencing analysis. These versions are formatted to accommodate the columns associated with their v3.0 export(46 columns).

I. JS_splitter_Imsqv4.5.py [filtertype] [filename]

Input:

filtertype – The filter “only” removes any CDR3 sequences which: A) contain stop codons (“*”) in the amino acid translation B) do not contain an amino acid translation C) contain ambiguous nucleotides (“N”, “R”, “W”, “Y”) in the nucleotide sequence D) do not contain mapped V or J cassettes The filter “all” carries such CDR3s through into the parsed repertoires, including those reads among the listed clonotypes, and in statistical calculations on the repertoire. The “only” option was used to produce the repertoires analyzed in Riaz et al. (2017). filename – adaptive-exported repertoire file Further input instructions are in the comments on the input argument lines.

Output:

Program strips the input filename of its suffix to provide the "samplename" base for the output filename, producing [samplename]_only.productive.tsv and directories titled aaCDR3, totCDR3, and VJ, to which thusly parsed repertoires from all samples will be written as tab-delimited text files (*.out).

In [samplename]_only.productive.tsv, data columns are as follows:

  • Col0: VJcombo – V-J cassette combination derived from Col2 and Col3
  • Col1: Counts – Number of reads
  • Col2: Vcassette – Identity of mapped V cassette
  • Col3: Jcassette – Identity of mapped J cassette
  • Col4: aaCDR3 – amino acid translation of CDR3 motif
  • Col5: ntCDR3 - nucleotides comprising CDR3 motif

Example:

python JS_splitter_Imsqv4.5.py only Pt3_pre.tsv

.
.
.

Pt3_pre_only.productive.tsv
VJ/Pt3_pre_VJ.out
aaCDR3/Pt3_pre_aaCDR3.out
totCDR3/Pt3_pre_totCDR3.out

II. JS_entropy_v7.7.py [filename1] [filename2]

Input:

After running JS_splitter_Imsqv4.5.py, in the working directory, each input sample file is represented by [samplename]_only.productive.tsv. For each pair of samples, filename1 and filename2 are therefore now represented by filename1 = [samplename1]_only.productive.tsv and filename2 = [samplename1]_only.productive.tsv.

Output:

A new directory is created and named [samplename1]_[samplename2]. In this directory, paired repertoire files (frequencies of each TCR in each of the two samples) will be produced for each resolution, and a file of individual and comparative entropy-based statistics, including a sample size control for the Jensen-Shannon Divergence Metric (JSM) which samples repertoire 1 to the number of clonotypes as repertoire 2. For more details on this operation, see Sims et al. (2016). Note that Evenness = 1-Clonality = Hnorm = H/Hmax, where H is entropy, Hmax (maximum entropy) = log2(N), therefore Hnorm signifies normalized entropy.

The output file [samplename1]_[samplename2]_H_CL_JS.out contains:

Line0: headers for single-repertoire output

Line1 and Line2:

  • Col0 = samplename1 or samplename2
  • Col1 = Hcdr3 (aaCDR3 entropy)
  • Col2 = Hvj (VJ entropy)
  • Col3 = Htot (totCDR3 entropy)
  • Col4 = CLcdr3 (aaCDR3 clonality)
  • Col5 = CLvj (VJ clonality)
  • Col6 = CLtot (totCDR3 clonality)
  • Col7 = Hcdr3_max (maximum aaCDR3 entropy; log2(N) where N is aaCDR3 population size)
  • Col8 = Hvj_max (maximum VJ entropy; log2(N) where N is VJ population size)
  • Col9 = Htot_max (maximum totCDR3 entropy; log2(N) where N is totCDR3 population size)
  • Col10 = Num_CDR3 (number of unique aaCDR3)
  • Col11 = Num_VJ (number of unique VJ combinations)
  • Col12 = Num_totCDR3 (number of unique totCDR3 clonotypes)

Line3: headers for paired-repertoire output

Line4:

  • Col0 = “[samplename1]_[samplename2]”
  • Col1 = JScdr3 (Jensen-Shannon divergence of aaCDR3 repertoires)
  • Col2 = JSvj (Jensen-Shannon divergence of VJ repertoires)
  • Col3 = JStotCDR3 (Jensen-Shannon divergence of totCDR3 repertoires)
  • Col4 = cdr3_COMBINEDnum (total number of unique aaCDR3s in both repertoires)
  • Col5 = vj_COMBINEDnum (total number of unique VJ combinations in both repertoires)
  • Col6 = tot_COMBINEDnum (total number of unique totCDR3s in both repertoires)
  • Col7 = JScdr3_ctrl (aaCDR3 Jensen-Shannon divergence of samplename1 and samplename1’)*
  • Col8 = JSvj_ctrl (VJ Jensen-Shannon divergence of samplename1 and samplename1’)*
  • Col9 = JStot_ctrl (totCDR3 Jensen-Shannon divergence of samplename1 and samplename1’)*
  • Col10 = JScdr3_ctrlVAR (variance in aaCDR3 Jensen-Shannon divergence of samplename1 and samplename1’)*
  • Col11 = JSvj_ctrlVAR (variance in VJ Jensen-Shannon divergence of samplename1 and samplename1’)*
  • Col12 = JStot_ctrlVAR (variance in totCDR3 Jensen-Shannon divergence of samplename1 and samplename1’)*
* Sampling control (samplename1’) is created by downsampling the specified repertoire {aaCDR3, VJ, totCDR3} of samplename1 to the number of unique members in the equivalent repertoire of samplename2. Divergence calculations are then performed as JSD(samplename1,samplename1’). This process is performed 10 times, with the results in Line4,Col7-9 representing the mean value over these iterations, and Line4,Col10-12 representing the variance in those values.

Line5: headers for single-repertoire statistics of sampling control (samplename1’)

Line6:

  • Col0 = “CTRL_[samplename1]_[samplename2]”
  • Col1 = Hcdr3_ctrl (entropy of samplename1aaCDR3’)
  • Col2 = Hvj_ctrl (entropy of samplename1VJ’)
  • Col3 = Htot_ctrl (entropy of samplename1totCDR3’)
  • Col4 = CLcdr3_ctrl (clonality of samplename1aaCDR3’)
  • Col5 = CLvj_ctrl (clonality of samplename1VJ’)
  • Col6 = CLtot_ctrl (clonality of samplename1totCDR3’)
  • Col7 = Hcdr3_max_ctrl (maximum entropy of samplename1aaCDR3’)
  • Col8 = Hvj_max_ctrl (maximum entropy of samplename1VJ’)
  • Col9 = Htot_max_ctrl (maximum entropy of samplename1totCDR3’)
  • Col10 = Num_CDR3_ctrl (number of unique aaCDR3 in samplename1aaCDR3’)
  • Col11 = Num_VJ_ctrl (number of unique VJ in samplename1VJ’)
  • Col12 = Num_totCDR3_ctrl (number of unique totCDR3 in samplename1totCDR3’)

Example:

python JS_entropy_v7.7.py Pt3_pre.tsv Pt3_on.tsv

.
.
.

Pt3_pre_Pt3_on/Pt3_pre_Pt3_on_H_CL_JS.out
Pt3_pre_Pt3_on/Pt3_pre_Pt3_on_cdr3.out
Pt3_pre_Pt3_on/Pt3_pre_Pt3_on_VJ.out
Pt3_pre_Pt3_on/Pt3_pre_Pt3_on_tot.out

In addition to these statistical outputs, the minimum number (N) or fraction (D) of unique clonotypes constituting 10, 25, 50, or 90 percent of the repertoire (e.g. D25) were calculated post-hoc from the tabulated repertoire files, when sorted in Microsoft Excel.

III.linepick_7.7.2.py

Input:

Run from the top-level working directory, this program extracts the single-repertoire and paired-repertoire entropy statistics lines from all files *_H_CL_JS.out found inside directories of the same prefix as those files.

Output:

linepick_entropy.out
linepick_divergence.out

All lines are extracted from the *_H_CL_JS.out file described in Col0.

These files are suitable tables for all downstream and graphical analyses of the repertoires. Values from Col10 (Num_CDR3) and Col4 (CLcdr3) were used to generate Figure 5B.

IV. JS_aaCDR3perVJ_v2.5.py [filename]

Input:

After running JS_splitter_Imsqv4.5.py, in the working directory, each input sample file is represented by [samplename]_only.productive.tsv. These are the expected [filename] inputs.

Output:

Produces tabulated output file [samplename]_aaCDR3perVJ.tsv, with columns as follows:
  • Col0: VJcombo (name of V-J cassette combination)
  • Col1: Number_aaCDR3 (number of aaCDR3 encoded by VJcombo in Col0)
  • Col2: H_aaCDR3 (entropy of aaCDR3 encoded by VJcombo in Col0)
  • Col3: Hnorm_aaCDR3 (entropy of aaCDR3 encoded by VJcombo in Col0, divided by maximum entropy; log2(N) where N is aaCDR3 population size)

Example:

python JS_aaCDR3perVJ_v2.5.py Pt3_pre_only.productive.tsv

.
.
.

Pt3_pre_only_aaCDR3perVJ.tsv

The summary statistics from these output files were used to generate Figures 5C and 5D in Riaz et al. (2017). Figured 5C and 5D were generated using the median Hnorm of the aaCDR3 per VJ, as calculated post-hoc in Microsoft Excel from Col1 and Col3 of the *_aaCDR3perVJ.tsv output files as follows: For each patient, the VJcombos and associated data from "pre" and "on" were merged, replacing the VJcombo identities in Col0 with "Pre" or "On" per the originating repertoire for that line. This file was used as input for multiple_joint_kde.py, with "Pre" and "On" as the subsets from which to generate the kernel density estimate plots in Figure 5D. Note that the Seaborn data visualization library must be installed (https://seaborn.pydata.org/index.html).

Figure Creation

createTCRfigures.r

This script will generate Figure 5B and 5C using data generated from scripts above. See file for additional details.