Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: SAM tag enum #102

Open
msto opened this issue Mar 12, 2024 · 8 comments
Open

Feature: SAM tag enum #102

msto opened this issue Mar 12, 2024 · 8 comments
Assignees

Comments

@msto
Copy link
Contributor

msto commented Mar 12, 2024

I think it would be helpful to provide an enum that describes the standard SAM tags.

I am happy to implement this but would appreciate feedback on design considerations before starting.

I have two primary questions.

  1. Should the member names be the actual SAM tags, or more semantically meaningful?

    e.g.

    class SamTag(str, Enum):
        """Standard SAM tags."""
        
        RG: "RG"
        """Read group."""
    
        RX: "RX"
        """Sequence bases of the (possibly corrected) unique molecular identifier."""

    or

    class SamTag(str, Enum):
        """Standard SAM tags."""
        
        READ_GROUP: "RG"
        """Read group."""
    
        UMI: "RX"
        """Sequence bases of the (possibly corrected) unique molecular identifier."""

    (note that I suggest mixing in str so the enums can be passed directly to pysam's tagging functions, e.g. read.has_tag(SamTag.UMI))

  2. To permit extensions to custom tags, I think providing a class decorator would be more helpful than subclassing. The decorator would

    1. inject the standard tags
    2. enforce uniqueness using enum.unique
    3. enforce that custom tags are two-character strings, and
    4. optionally enforce that custom tags adhere to SAM convention, namely that tags start with "X", "Y", or "Z", or are lowercase

    (i) and (ii) would be achievable with subclassing, but I think a decorator is necessary to provide the validations on custom tags. Thoughts?

    e.g.

    @sam_tag(strict=False)
    class ClientTag(str, Enum):
        """Custom SAM tags used for $client."""
        
        FOO: "XF"
        """Foo."""
        
        BAR: "XB"
        """Bar."""
@msto msto self-assigned this Mar 12, 2024
@nh13
Copy link
Member

nh13 commented Mar 15, 2024

Does this belong in pysam, or in htslib and exposed in pysam? Have we asked them if they would accept a contribution? And if not, why?

@msto
Copy link
Contributor Author

msto commented Mar 15, 2024

I can open a similar issue on pysam.

I don't think I have a clear mental model on distinguishing the features we consider for inclusion in fgpyo vs attempting to submit upstream, so I started here

@msto
Copy link
Contributor Author

msto commented Mar 15, 2024

@msto
Copy link
Contributor Author

msto commented Apr 17, 2024

@nh13 Given the lack of response from pysam, would this be re-considered for inclusion in fgpyo?

I have an initial implementation at https://github.com/msto/sam_tags/ that I would love to merge in here

@msto
Copy link
Contributor Author

msto commented May 27, 2024

@nh13 bump 🙂

@msto
Copy link
Contributor Author

msto commented Jul 12, 2024

@nh13 bump

@nh13
Copy link
Member

nh13 commented Jul 13, 2024

I am ok with this. @tfenne ?

@tfenne
Copy link
Member

tfenne commented Jul 13, 2024

I'm game for the first half - i.e. an enum that encodes standard values. I generally prefer to use the tags for the name (so e.g. NM, SA) rather than a longer name because the former is explicit and well-known and I think this is mostly to prevent typos?

I'm not sure I see the value (yet) in the decorator for making other enums of sam tags. I.e. I tend to think we should use str anywhere we require a tag, and then you can supply one of these enum values, a str directly or another similar enum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants