Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistency between transcript_tpm.tsv and transcript_model_tpm.tsv #248

Open
Mangosteen24 opened this issue Sep 30, 2024 · 2 comments
Open

Comments

@Mangosteen24
Copy link

Hi Thank you for developing this useful tool!

I would like to inquire about the differences between transcript_tpm.tsv and transcript_model_tpm.tsv.
The file transcript_model_tpm.tsv contains the expression of discovered transcript models in TPM (and corresponds to transcript_models.gtf). It should include all expressed transcripts, both novel and known, correct? However, when I search for a specific transcript, such as the canonical transcript ENST00000275493, I find it only exists in transcript_tpm.tsv with a high TPM value, while it is absent from transcript_model_tpm.tsv. I understand that those are two different algorithms: reference-based and discovery, but it is a bit weird that the ENST00000275493 is completely absent from transcript_model_tpm.tsv. Any reason for this inconsistency?

Smilarly, ENST00000275493 is absent from transcript_models.gtf. So, would you recommend using transcript_models.gtf or extended_annotation.gtf for downstream analyses, such as SQANTI3?

Thank you!

@andrewprzh
Copy link
Collaborator

Dear @Mangosteen24

Thanks you for the feedback!

You understanding is correct, and this is indeed a little bit odd. To understand where this inconsistency stems from one has to go deeper in the algorithms and data.

A few questions do you use. Which version do you use? Some of the inconsistencies were fixed at some point, but I cannot guarantee all of the are eliminated.
What reads are assigned to these isoforms in .read_assignments.tsv.gz and what are their assignment types? Where do these reads go in .transcript_model_reads.tsv.gz?

Best
Andrey

@Mangosteen24
Copy link
Author

Mangosteen24 commented Oct 2, 2024

Hi Andrey I use the latest version of isoquant v3.6.1

First I checked what reads were assigned to ENST00000275493 and their assignment_type in OUT.read_assignments.tsv.gz

1165 ambiguous
659 inconsistent
13394 inconsistent_ambiguous
20156 inconsistent_non_intronic
10032 unique
19 unique_minor_difference

Then I checked those 10032 unique read_id in OUT.transcript_model_reads.tsv.gz

9049 *
 219 ENST00000344576
  28 ENST00000450046
   3 ENST00000485503
 236 transcript13328.chr7.nic
  56 transcript13334.chr7.nnic
  21 transcript13336.chr7.nic
 108 transcript13359.chr7.nic
 193 transcript13376.chr7.nic
 101 transcript13394.chr7.nic
   3 transcript13436.chr7.nic
   3 transcript13455.chr7.nic
   1 transcript13579.chr7.nnic
 804 transcript13776.chr7.nnic
   9 transcript13842.chr7.nnic
 621 transcript13858.chr7.nnic
 142 transcript13993.chr7.nnic
   3 transcript13995.chr7.nic

It seems that most unique reads were assigned to * instead of ENST00000275493, which means it is not a known transcript, or NIC or NNIC? I also noticed that most of the unique reads' assignment events were 'mono_exonic'; maybe that's the reason it cannot differentiate which isoform they come from?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants