Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove/collapse duplicate transcripts from combined quantification GTF #34

Open
SamBryce-Smith opened this issue Oct 13, 2022 · 0 comments

Comments

@SamBryce-Smith
Copy link
Member

When subsettign to individual exons, many may be identical entirely at the sequence/region level if they are shared between different full length transcripts. THis is wasted output in the GTF (inflating its size) and also triggers a warning when generating the salmon index.

[2022-03-19 15:44:58.606] [puff::index::jointLog] [warning] Removed 10618 transcripts that were sequence duplicates of indexed transcripts.
[2022-03-19 15:44:58.606] [puff::index::jointLog] [warning] If you wish to retain duplicate transcripts, please use the `--keepDuplicates` flag

Would maybe be good to double-check for sequence duplicates prior to outputting the GTF. Could always assign a 'combined tx id' in these cases (e.g. transcript IDs combined with string separator).

As salmon index removes duplicates this shouldn't cause any downstream problems, save for potentially tx IDs disappearign between the quant GTF and salmon quantification output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant