Transcript quantification results are inconsistent #225

biochristmas · 2024-08-14T15:02:49Z

Hi, I ran this command to perform Isoquant quantification with the aim of obtaining transcript quantification results for all samples: isoquant.py --reference genome.fa --genedb ./merge.combined.gtf --fastq CK.derRNA.fastq T1.derRNA.fastq T2.derRNA.fastq --data_type nanopore --model_construction_strategy all --threads 60 -o isoquant_quant. The GTF annotation file contains 320,000 transcripts. Based on Isoquant quantification, two quantification result files were generated: OUT.transcript_model_grouped_tpm.tsv and OUT.transcript_grouped_tpm.tsv. The number of transcripts (rows) in OUT.transcript_model_grouped_tpm.tsv is 260,000, while in OUT.transcript_grouped_tpm.tsv it is 320,000. Can I interpret the OUT.transcript_grouped_tpm.tsv file as containing the total transcript quantification results? Additionally, I noticed that the TPM values for the same transcript in OUT.transcript_model_grouped_tpm.tsv and OUT.transcript_grouped_tpm.tsv are inconsistent. Third question: If all TPM values for a transcript are zero in the OUT.transcript_grouped_tpm.tsv file, then that transcript will not appear in the OUT.transcript_model_grouped_tpm.tsv file. Therefore, I believe that the OUT.transcript_model_grouped_tpm.tsv file contains quantification results for transcripts that are expressed, while the OUT.transcript_grouped_tpm.tsv file contains quantification results for all transcripts, regardless of whether all sample TPM values are zero.

andrewprzh · 2024-08-20T15:12:59Z

Dear @biochristmas

OUT.transcript_grouped_tpm.tsv contain all the reference transcripts, even those with 0 counts. There are no novel transcripts in this file.
OUT.transcript_model_grouped_tpm.tsv contains all the transcripts from the OUT.transcript_models.gtf (both novel and known). There should be no rows with 0 counts/TPMs since these transcripts must be supported by reads.

These files are produced by different algorithms, so some inconsistency is expected. Although I'd agree it's quite noticeable n your example.
Which version are you using, by the way?

Best
Andrey

biochristmas · 2024-08-22T14:43:09Z

@andrewprzh , I am using version 3.4.1 of IsoQuant。I have a new question: I noticed that when I run this command to expand transcript annotations on the reference genome: isoquant.py --reference ./reference.fa --genedb ./reference.gtf --fastq ./1_clustered.fasta ./2_clustered.fasta ./3_clustered.fasta --data_type pacbio_ccs --model_construction_strategy all --fl_data --threads 60 -o isoquant_result. Note: The FASTA sequences provided in the --fastq parameter are the results of clustering full-length transcript sequences using Isoseq3, and the number of sequences in clustered.fasta is much fewer compared to flnc.fasta. When I execute this command, the resulting file OUT.extended_annotation.gtf contains a total of 107,290 transcripts. However, when I use flnc.fasta (the full-length transcript sequences before clustering) as the input for the --fastq parameter, the resulting file OUT.extended_annotation.gtf contains 185,626 transcripts.
So, I would like to ask if the number of transcripts obtained when expanding reference genome annotations using isoquant.py is influenced by the number of sequences in the input files."

andrewprzh · 2024-09-03T13:54:17Z

@biochristmas

Yes, the number of reported transcripts can be affected by the number of sequences (something that we also call "read support"). So such result is expected.

If you use --data_type pacbio_ccs with clustered PacBio transcripts, it may happen that some of the transcripts (especially novel) are not reported since they are supported only by a single sequence. You may try running IsoQuant with --data_type transcripts, which will require only a single supporting sequence for a transcript to be reported. For FLNC reads using --data_type pacbio_ccs should work fine.

Best
Andrey

andrewprzh added the question Further information is requested label Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transcript quantification results are inconsistent #225

Transcript quantification results are inconsistent #225

biochristmas commented Aug 14, 2024

andrewprzh commented Aug 20, 2024

biochristmas commented Aug 22, 2024

andrewprzh commented Sep 3, 2024

Transcript quantification results are inconsistent #225

Transcript quantification results are inconsistent #225

Comments

biochristmas commented Aug 14, 2024

andrewprzh commented Aug 20, 2024

biochristmas commented Aug 22, 2024

andrewprzh commented Sep 3, 2024