Replace home-brew string end searching with memchr. #1842

jkbonfield · 2024-09-24T16:19:54Z

With long aux tags the trival while loop can be suprisingly slow.

"while (s < end && *s) ++s;" isn't well vectorised or turned into word-by-word processing by neither gcc nor clang, but these tricks are used by the system memchr implementation.

An alternative could be this (used in my WIP VCF parser), which is more optimised for relatively short strings. Included here just for potential future reference on systems with noddy memchr implementations.

#define haszero(x) (((x)-0x0101010101010101UL)&~(x)&0x8080808080808080UL)
static inline char *memchr8(char *s, char sym, size_t len) {
    const uint64_t sym8 = sym * 0x0101010101010101UL;
    uint64_t *s8 = (uint64_t *)s;
    uint64_t *s8_end = (uint64_t *)(s+(len&~7));

    while (s8 < s8_end && !haszero(*s8 ^ sym8))
        s8++;

    // Precise identification
    char *s_end = s + len;
    s = (char *)s8;
    while (s < s_end && *s != sym) {
        s++;
    }

    return s < s_end ? s : NULL;
}

jkbonfield · 2024-09-24T16:30:25Z

Some benchmarks using samtools reset which heavily uses the aux tag iterator.

My input data is an Ultima Genomics BAM, which has some long tags in there.

perf record samtools reset -O bam,level=0 -o /dev/null ~/lustre/qual_train/data/Ultima/HG002.10M-20M.bam
perf report -n | egrep 'skip_aux|aux_next'

Summed skip_aux and bam_aux_next perf counters for clang16: before=1697, now=473.
With gcc13: before=1342, now=218.

That seemed a bit extreme, so I was wondering if optimisation was getting in the way and inlining moving things, but -fno-inline backs up the huge speed difference. This input data typically has one Z tag of around 300 bytes.

Tested on some Novaseq alignments with no long string tags, the speed was around 170 (old) to 130 (new), so minimal, but still a small benefit.

With long aux tags the trival while loop can be suprisingly slow. "while (s < end && *s) ++s;" isn't well vectorised or turned into word-by-word processing by neither gcc nor clang, but these tricks are used by the system memchr implementation. An alternative could be this (used in my WIP VCF parser), which is more optimised for relatively short strings. Included here just for potential future reference on systems with noddy memchr implementations. #define haszero(x) (((x)-0x0101010101010101UL)&~(x)&0x8080808080808080UL) static inline char *memchr8(char *s, char sym, size_t len) { const uint64_t sym8 = sym * 0x0101010101010101UL; uint64_t *s8 = (uint64_t *)s; uint64_t *s8_end = (uint64_t *)(s+(len&~7)); while (s8 < s8_end && !haszero(*s8 ^ sym8)) s8++; // Precise identification char *s_end = s + len; s = (char *)s8; while (s < s_end && *s != sym) { s++; } return s < s_end ? s : NULL; }

vasudeva8 · 2024-09-27T15:10:39Z

We may update the same change in samtools view, where the same method is copied.

daviesrob assigned vasudeva8 Sep 26, 2024

jkbonfield force-pushed the skip_aux_memchr branch from 3b71dbc to 3b486e5 Compare September 26, 2024 10:11

vasudeva8 merged commit 2ff207b into samtools:develop Sep 27, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace home-brew string end searching with memchr. #1842

Replace home-brew string end searching with memchr. #1842

jkbonfield commented Sep 24, 2024

jkbonfield commented Sep 24, 2024

vasudeva8 commented Sep 27, 2024

Replace home-brew string end searching with memchr. #1842

Replace home-brew string end searching with memchr. #1842

Conversation

jkbonfield commented Sep 24, 2024

jkbonfield commented Sep 24, 2024

vasudeva8 commented Sep 27, 2024