Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace home-brew string end searching with memchr. #1842

Merged
merged 1 commit into from
Sep 27, 2024

Conversation

jkbonfield
Copy link
Contributor

With long aux tags the trival while loop can be suprisingly slow.

"while (s < end && *s) ++s;" isn't well vectorised or turned into word-by-word processing by neither gcc nor clang, but these tricks are used by the system memchr implementation.

An alternative could be this (used in my WIP VCF parser), which is more optimised for relatively short strings. Included here just for potential future reference on systems with noddy memchr implementations.

#define haszero(x) (((x)-0x0101010101010101UL)&~(x)&0x8080808080808080UL)
static inline char *memchr8(char *s, char sym, size_t len) {
    const uint64_t sym8 = sym * 0x0101010101010101UL;
    uint64_t *s8 = (uint64_t *)s;
    uint64_t *s8_end = (uint64_t *)(s+(len&~7));

    while (s8 < s8_end && !haszero(*s8 ^ sym8))
        s8++;

    // Precise identification
    char *s_end = s + len;
    s = (char *)s8;
    while (s < s_end && *s != sym) {
        s++;
    }

    return s < s_end ? s : NULL;
}

@jkbonfield
Copy link
Contributor Author

Some benchmarks using samtools reset which heavily uses the aux tag iterator.

My input data is an Ultima Genomics BAM, which has some long tags in there.

perf record samtools reset -O bam,level=0 -o /dev/null ~/lustre/qual_train/data/Ultima/HG002.10M-20M.bam
perf report -n | egrep 'skip_aux|aux_next'

Summed skip_aux and bam_aux_next perf counters for clang16: before=1697, now=473.
With gcc13: before=1342, now=218.

That seemed a bit extreme, so I was wondering if optimisation was getting in the way and inlining moving things, but -fno-inline backs up the huge speed difference. This input data typically has one Z tag of around 300 bytes.

Tested on some Novaseq alignments with no long string tags, the speed was around 170 (old) to 130 (new), so minimal, but still a small benefit.

With long aux tags the trival while loop can be suprisingly slow.

"while (s < end && *s) ++s;" isn't well vectorised or turned into
word-by-word processing by neither gcc nor clang, but these tricks are
used by the system memchr implementation.

An alternative could be this (used in my WIP VCF parser), which is
more optimised for relatively short strings.  Included here just for
potential future reference on systems with noddy memchr
implementations.

    #define haszero(x) (((x)-0x0101010101010101UL)&~(x)&0x8080808080808080UL)
    static inline char *memchr8(char *s, char sym, size_t len) {
        const uint64_t sym8 = sym * 0x0101010101010101UL;
        uint64_t *s8 = (uint64_t *)s;
        uint64_t *s8_end = (uint64_t *)(s+(len&~7));

        while (s8 < s8_end && !haszero(*s8 ^ sym8))
            s8++;

        // Precise identification
        char *s_end = s + len;
        s = (char *)s8;
        while (s < s_end && *s != sym) {
            s++;
        }

        return s < s_end ? s : NULL;
    }
@vasudeva8
Copy link
Contributor

We may update the same change in samtools view, where the same method is copied.

@vasudeva8 vasudeva8 merged commit 2ff207b into samtools:develop Sep 27, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants