Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transcript quantification results are inconsistent #225

Open
biochristmas opened this issue Aug 14, 2024 · 3 comments
Open

Transcript quantification results are inconsistent #225

biochristmas opened this issue Aug 14, 2024 · 3 comments
Labels
question Further information is requested

Comments

@biochristmas
Copy link

Hi, I ran this command to perform Isoquant quantification with the aim of obtaining transcript quantification results for all samples: isoquant.py --reference genome.fa --genedb ./merge.combined.gtf --fastq CK.derRNA.fastq T1.derRNA.fastq T2.derRNA.fastq --data_type nanopore --model_construction_strategy all --threads 60 -o isoquant_quant. The GTF annotation file contains 320,000 transcripts. Based on Isoquant quantification, two quantification result files were generated: OUT.transcript_model_grouped_tpm.tsv and OUT.transcript_grouped_tpm.tsv. The number of transcripts (rows) in OUT.transcript_model_grouped_tpm.tsv is 260,000, while in OUT.transcript_grouped_tpm.tsv it is 320,000. Can I interpret the OUT.transcript_grouped_tpm.tsv file as containing the total transcript quantification results? Additionally, I noticed that the TPM values for the same transcript in OUT.transcript_model_grouped_tpm.tsv and OUT.transcript_grouped_tpm.tsv are inconsistent. Third question: If all TPM values for a transcript are zero in the OUT.transcript_grouped_tpm.tsv file, then that transcript will not appear in the OUT.transcript_model_grouped_tpm.tsv file. Therefore, I believe that the OUT.transcript_model_grouped_tpm.tsv file contains quantification results for transcripts that are expressed, while the OUT.transcript_grouped_tpm.tsv file contains quantification results for all transcripts, regardless of whether all sample TPM values are zero.
figure

@andrewprzh
Copy link
Collaborator

Dear @biochristmas

OUT.transcript_grouped_tpm.tsv contain all the reference transcripts, even those with 0 counts. There are no novel transcripts in this file.
OUT.transcript_model_grouped_tpm.tsv contains all the transcripts from the OUT.transcript_models.gtf (both novel and known). There should be no rows with 0 counts/TPMs since these transcripts must be supported by reads.

These files are produced by different algorithms, so some inconsistency is expected. Although I'd agree it's quite noticeable n your example.
Which version are you using, by the way?

Best
Andrey

@andrewprzh andrewprzh added the question Further information is requested label Aug 20, 2024
@biochristmas
Copy link
Author

@andrewprzh , I am using version 3.4.1 of IsoQuant。I have a new question: I noticed that when I run this command to expand transcript annotations on the reference genome: isoquant.py --reference ./reference.fa --genedb ./reference.gtf --fastq ./1_clustered.fasta ./2_clustered.fasta ./3_clustered.fasta --data_type pacbio_ccs --model_construction_strategy all --fl_data --threads 60 -o isoquant_result. Note: The FASTA sequences provided in the --fastq parameter are the results of clustering full-length transcript sequences using Isoseq3, and the number of sequences in clustered.fasta is much fewer compared to flnc.fasta. When I execute this command, the resulting file OUT.extended_annotation.gtf contains a total of 107,290 transcripts. However, when I use flnc.fasta (the full-length transcript sequences before clustering) as the input for the --fastq parameter, the resulting file OUT.extended_annotation.gtf contains 185,626 transcripts.
So, I would like to ask if the number of transcripts obtained when expanding reference genome annotations using isoquant.py is influenced by the number of sequences in the input files."

@andrewprzh
Copy link
Collaborator

@biochristmas

Yes, the number of reported transcripts can be affected by the number of sequences (something that we also call "read support"). So such result is expected.

If you use --data_type pacbio_ccs with clustered PacBio transcripts, it may happen that some of the transcripts (especially novel) are not reported since they are supported only by a single sequence. You may try running IsoQuant with --data_type transcripts, which will require only a single supporting sequence for a transcript to be reported. For FLNC reads using --data_type pacbio_ccs should work fine.

Best
Andrey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants