inconsistency between transcript_tpm.tsv and transcript_model_tpm.tsv #248

Mangosteen24 · 2024-09-30T02:58:25Z

Hi Thank you for developing this useful tool!

I would like to inquire about the differences between transcript_tpm.tsv and transcript_model_tpm.tsv.
The file transcript_model_tpm.tsv contains the expression of discovered transcript models in TPM (and corresponds to transcript_models.gtf). It should include all expressed transcripts, both novel and known, correct? However, when I search for a specific transcript, such as the canonical transcript ENST00000275493, I find it only exists in transcript_tpm.tsv with a high TPM value, while it is absent from transcript_model_tpm.tsv. I understand that those are two different algorithms: reference-based and discovery, but it is a bit weird that the ENST00000275493 is completely absent from transcript_model_tpm.tsv. Any reason for this inconsistency?

Smilarly, ENST00000275493 is absent from transcript_models.gtf. So, would you recommend using transcript_models.gtf or extended_annotation.gtf for downstream analyses, such as SQANTI3?

Thank you!

andrewprzh · 2024-10-01T20:51:17Z

Dear @Mangosteen24

Thanks you for the feedback!

You understanding is correct, and this is indeed a little bit odd. To understand where this inconsistency stems from one has to go deeper in the algorithms and data.

A few questions do you use. Which version do you use? Some of the inconsistencies were fixed at some point, but I cannot guarantee all of the are eliminated.
What reads are assigned to these isoforms in .read_assignments.tsv.gz and what are their assignment types? Where do these reads go in .transcript_model_reads.tsv.gz?

Best
Andrey

Mangosteen24 · 2024-10-02T02:54:12Z

Hi Andrey I use the latest version of isoquant v3.6.1

First I checked what reads were assigned to ENST00000275493 and their assignment_type in OUT.read_assignments.tsv.gz

1165 ambiguous
659 inconsistent
13394 inconsistent_ambiguous
20156 inconsistent_non_intronic
10032 unique
19 unique_minor_difference

Then I checked those 10032 unique read_id in OUT.transcript_model_reads.tsv.gz

9049 *
 219 ENST00000344576
  28 ENST00000450046
   3 ENST00000485503
 236 transcript13328.chr7.nic
  56 transcript13334.chr7.nnic
  21 transcript13336.chr7.nic
 108 transcript13359.chr7.nic
 193 transcript13376.chr7.nic
 101 transcript13394.chr7.nic
   3 transcript13436.chr7.nic
   3 transcript13455.chr7.nic
   1 transcript13579.chr7.nnic
 804 transcript13776.chr7.nnic
   9 transcript13842.chr7.nnic
 621 transcript13858.chr7.nnic
 142 transcript13993.chr7.nnic
   3 transcript13995.chr7.nic

It seems that most unique reads were assigned to * instead of ENST00000275493, which means it is not a known transcript, or NIC or NNIC? I also noticed that most of the unique reads' assignment events were 'mono_exonic'; maybe that's the reason it cannot differentiate which isoform they come from?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inconsistency between transcript_tpm.tsv and transcript_model_tpm.tsv #248

inconsistency between transcript_tpm.tsv and transcript_model_tpm.tsv #248

Mangosteen24 commented Sep 30, 2024

andrewprzh commented Oct 1, 2024

Mangosteen24 commented Oct 2, 2024 •

edited

Loading

inconsistency between transcript_tpm.tsv and transcript_model_tpm.tsv #248

inconsistency between transcript_tpm.tsv and transcript_model_tpm.tsv #248

Comments

Mangosteen24 commented Sep 30, 2024

andrewprzh commented Oct 1, 2024

Mangosteen24 commented Oct 2, 2024 • edited Loading

Mangosteen24 commented Oct 2, 2024 •

edited

Loading