Q: `pod5 subset <..> --threads` takes a lot of resources #93

sklages · 2023-12-01T21:30:51Z

I just started using pod5 subset to regroup my POD5 data per-channel to optimize dorado duplex basecalling speed.

The default setting for --threads is 4, I was brave and set it to 8.. well, ..

System: Linux, 128 cores, 1TB RAM

Load before starting pod5 was around 34 ..

Quickly after starting sth like:

    pod5 subset \
        --threads 8 \
        --force-overwrite \
        --recursive \
        --summary $rc_file \
        --columns channel \
        --output $POD5_TMPDIR \
        $RAW_DATA_DIR

with a small (7Gbyte) P2 dataset, the system load went over 200, making the system a bit sluggish. Heavy I/O I guess ..
top showed the 8 python processes, each with 700 to 1200% in different states (R,S,D) ...

top - 21:43:28 up 274 days, 10:38, 29 users,  load average: 214.81, 106.81, 77.11
<...>
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 19971 USER      20   0   22.2g 349288  51752 R  1210   0.0  36:49.68 python3
 19972 USER      20   0   21.6g 288420  51156 S  1041   0.0  37:35.33 python3
 19967 USER      20   0   21.7g 377372  50984 D 989.2   0.0  36:09.18 python3
 19969 USER      20   0   21.7g 277636  48848 R 972.3   0.0  35:51.98 python3
 19974 USER      20   0   21.7g 380280  54068 S 949.0   0.0  38:09.10 python3
 19973 USER      20   0   21.7g 362636  48292 R 943.6   0.0  35:44.70 python3
 19968 USER      20   0   21.6g 286052  48144 S 875.5   0.0  37:15.61 python3
 19970 USER      20   0   21.7g 270932  47140 D 843.0   0.0  36:15.94 python3

This is by far more than I would expect when using --threads 8 ...

Is there a smooth way to better control the CPUs used by pod5 subset and thus I/O throughput (avoiding sth like taskset)?
This small dataset took ~7 minutes to finish, the larger datasets are 100x to 200x in size. So I wonder what is considered "best practice"? On shared servers this is quite a big issue.

Instead of writing a few thousand per-channel POD5 files, wouldn't it be convenient to write one or some more large(r) per-channel-sorted POD5 files?

Any ideas/comments/remarks are welcome :-)

The text was updated successfully, but these errors were encountered:

sklages · 2023-12-02T07:50:52Z

Now using a large P2 dataset but leaving the --threads on default (4).

This runs out of memory on the same machine, 1TB RAM (~550G free), reformatted for better readability:

### [2023-12-02 07:27:45] START: Merging POD5 files by channel..
Parsed 51501778 targets
memory allocation of 2883584000 bytes failed

(core dumped) pod5 subset 
  --threads 4 
  --force-overwrite 
  --recursive 
  --summary $rc_file 
  --columns channel 
  --output $POD5_TMPDIR 
  $RAW_DATA_DIR

Both RAW_DATA_DIR and POD5_TMPDIR are not on the same storage.

$rc_file is 2GB in size,

$ head pod5_summary_per-channel.tsv
read_id channel
7636a648-a348-4b3a-9925-39587c4dfbbd    2727
0355921b-3da6-45a7-af50-45b9a7e1d93f    1940
bf818b42-7108-4f63-9115-0df5a66c9db6    1572
<...>

$ pod5 -v
Pod5 version: 0.3.2

Is there something very obvious which I am missing here?

HalfPhoton · 2023-12-02T22:20:26Z

Hi @sklages , we are in the process of updating subset to use significantly fewer resources.

Thanks for raising this issue. We'll let you know when we push these changes up.

HalfPhoton mentioned this issue Dec 11, 2023

pod5 filter freeze #95

Open

sklages mentioned this issue Feb 7, 2024

option to split pod5 by size/read number #109

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q: `pod5 subset <..> --threads` takes a lot of resources #93

Q: `pod5 subset <..> --threads` takes a lot of resources #93

sklages commented Dec 1, 2023

sklages commented Dec 2, 2023

HalfPhoton commented Dec 2, 2023

Q: pod5 subset <..> --threads takes a lot of resources #93

Q: pod5 subset <..> --threads takes a lot of resources #93

Comments

sklages commented Dec 1, 2023

sklages commented Dec 2, 2023

HalfPhoton commented Dec 2, 2023

Q: `pod5 subset <..> --threads` takes a lot of resources #93

Q: `pod5 subset <..> --threads` takes a lot of resources #93