Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RP2040 multicore support with example #17

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

KoviRobi
Copy link
Contributor

@KoviRobi KoviRobi commented Dec 28, 2023

Hi,

I hope you had/are having happy holidays if you celebrate it! I've been having lots of fun learning and hacking more about Send/Sync, async and lilos -- thank you for your blog-posts and software :)

This does several things, to get something like https://github.com/rp-rs/rp-hal/blob/main/rp2040-hal/examples/multicore_fifo_blink.rs working.

I'm not an expert on this so it is possible there are better ways to do things -- first it would make sense to read through the description and check my logic makes sense before reviewing the actual code.

  • First, I've done cargo fmt commits so they can be easily dropped if you want, my editor just automatically formats on save so it's easiest to do this upfront, but can now drop those, or you can ignore them by looking at a diff excluding them.
  • Use https://docs.rs/portable-atomic/latest/portable_atomic/ because those can make use of the RP2040 hardware spinlocks. I've compared the resulting binaries (see the commit for more details on the horrible sed expression to make the diff nicer), for the STM32F4 and STM32H7 examples, they produce identical-looking binaries. For the RP2040/minimal it skips some DMB SY instructions, presumably because of the single-core assumption.
  • Adds a multicore example, which is similar to https://github.com/rp-rs/rp-hal/blob/main/rp2040-hal/examples/multicore_fifo_blink.rs (and I think shows off the advantage of using async/await as the code is a lot simpler due to the lack of the explicit state-machine).
  • Adds a 2core feature to lilos, see below

lilos core changes/design considerations

  • This is called 2core because some things are now arrays are initialized to [_; 2], it should be just a matter of changing the 2 to a different number if there are 4-core microcontrollers etc, I don't imagine there are so many different configurations that it will be a problem. I couldn't find a reasonable way to have Rust/Cargo configure an integer value instead of the boolean on/off of feature flags.
  • Currently there is a single WAKE_BITS, which given we are using multi-core spinlocks is fine. I did consider having a WAKE_BITS for each core, that made it so that Notify had to be only subscribed on that core for the mask to make sense, which seemed less nice.
  • Per-core separate timer lists, if we are using systick we cannot be sure the timers are aligned -- we could use timerawl/timerawh on the RP2040 if we wanted a single source of time, but we would have to lock/unlock the linked list at that point.
  • Changing the debug TASK_FUTURES to be per-core, mostly because I don't think we could append it zero-cost in a nice way. I have made changes to lilosdbg to go along with this (as well as portable_atomics, and some other small changes).
  • Currently the Send + 'static restriction indicating multi-core is coming from rp2040_hal::multicore::Core::spawn, I think this is correct, as that is what is creating the new thread, not lilos::exec::run_tasks

Timer related considerations

Sorry I've ended up making a fair bit of changes, here is the summary:

  • Added a timer feature (on by default), which turns on/off any timers (previously systick did this)
  • The systick feature specifically turns on an implementation of timer that uses systick (this is still default)
  • Added a Timer trait, for which there is the SysTickTimer implementation if systick is on
  • This way half the code in time.rs (sleep_until, sleep_for, with_deadline, with_timeout, PeriodicGate) is reusable if you implement custom timers.
  • This has resulted in some slight API changes, notably you need to pass SysTickTimer to run_*, and to sleep_*/etc. I could make this backwards compliant with #[cfg(...)] macros, but it ends up looking a bit ugly.

After you have had a brief look and have decided there are no more big changes, there is still some tidying for me to do: rebuild the STM32* examples and make sure they work (for simplicity I've currently only done the RP2040 ones mostly). Mostly just changing them to use the new named SysTickTimer

@KoviRobi
Copy link
Contributor Author

KoviRobi commented Dec 28, 2023

Oh hang on, I realised timers work differently than I thought (manual millisecond timer), I'll fix that soon.

(The timers are being incremented N times, where N is the number of lilos::exec::run_tasks running, whoops)

@KoviRobi
Copy link
Contributor Author

KoviRobi commented Dec 30, 2023

Since I think it's in a good state now, I'll pop the first couple cargo fmt commits (this comment is partly for me to know when I did that if I want to go back).

This should reduce the overall diff

@KoviRobi KoviRobi force-pushed the rp2040-multicore-with-example branch 2 times, most recently from 3b8dc72 to 64e3f8c Compare December 30, 2023 22:26
@KoviRobi KoviRobi force-pushed the rp2040-multicore-with-example branch from 64e3f8c to 8384665 Compare January 20, 2024 20:09
@cbiffle
Copy link
Owner

cbiffle commented Mar 2, 2024

Hi! I'm going through the outstanding PRs (whoops!) and this is really neat. I need to take some time to sit and think about the design implications. You've done several possibly-orthogonal things here, each of which is something I've thought about doing (but haven't):

  • Generalizing the time API to support custom timers (I needed this on nRF52, where you can't use systick to use the low power modes because it stops)
  • Using portable_atomics (last I looked it was missing features I needed, but I know they were planning on adding them, so that may have changed)

And then of course the SMP part. (I'm actually using lilos with SMP on rp2040 already, but not running the lilos scheduler on the second core -- just using it to do math. Your approach is more general and flexible.)

Mostly I need to give some thought to the handling of notification masks and wake bits across cores. I hope to have some time to read your changes in detail over the next few days and get back to you.

Thanks for looking into this!

@KoviRobi
Copy link
Contributor Author

KoviRobi commented Mar 2, 2024

I actually needed the two cores because I am planning on using one of the cores to debug the other one! And still need USB for the other core's stuff (making a keyboard) so need to get data from one to the other. (Very much WIP, need to fix some stuff with it, maybe even probe-rs/openocd as they don't quite work in different ways andhttps://github.com/KoviRobi/rp2040-selfdebug)

I have an example that does profiling (well, just measuring the task switch time) using SysTick, turns out the multicore HW spin lock isn't very fast, for single core you want portable atomics with "unsafe assume single core".

Not that I have speed issues, I'm just coding for coding's sake

Speaking of, I have implemented the bitmask approach for the task wake loop, haven't profiled it yet.

I'll try to rebase this and push those on a different MR probably, if you want. Currently a different project has my hobby attention though

(A bit rambling because I'm on a train on my phone and it's just hard to edit/read the whole post, sorry)

@cbiffle
Copy link
Owner

cbiffle commented Mar 2, 2024

turns out the multicore HW spin lock isn't very fast

That's surprising, if it's using the hardware feature I'm thinking of ... I'd be interested in looking at the generated code. I would expect it to be on the order of four cycles, but admittedly the RaspPi folks don't specifically describe any wait states etc. So there could be a penalty lurking.

I took a look at your handling of WAKE_BITS and I would not have considered your approach! Basically claiming bits starting from the LSB as each core's scheduler is fired up. Given that I usually have a lot fewer than 16 tasks per core in my applications, that seems considerably simpler than what I was expecting to have to do, which is a WAKE_BITS array. So, that's neat.

I'm kind of inclined to convert all the per-core statics into arrays, and just have arrays-of-one in the uniprocessor case. That would simplify your diff, I think -- on a uniprocessor machine cpu_core_id could be hardcoded to return 0, but accesses to things like the timer list could always assume an array.

Aside:

It would sure be nice if there were more symmetric multiprocessor microcontrollers. The RP2040 is the only common one that I'm aware of, and I'm concerned that we might produce a design that is too RP2040 specific. All the other multicore microcontrollers I've used are asymmetric, meaning you can't really run parts of the same program on both cores because they have different instruction sets. (In that case you want two separate copies of lilos instead of SMP.)

@cbiffle
Copy link
Owner

cbiffle commented Mar 6, 2024

Okay, I've had some more time to read through this and think about it, and there are definitely chunks of this I'm interested in trying to merge -- and some other chunks of it that I'm still thinking carefully about.

Are you comfortable with me extracting portions of this into smaller commits as long as I credit you in the commit description? I don't want to steal your thunder here, but I also don't want to commit things I wrote under your name. :-)

@KoviRobi
Copy link
Contributor Author

KoviRobi commented Mar 6, 2024

Of course, feel free to split and modify – my git fu is also reasonably advanced so happy to split things up too, but currently just busy with other stuff :)

@KoviRobi
Copy link
Contributor Author

KoviRobi commented Mar 6, 2024

Also, this doesn't yet include support for mutexes being multicore (I have a draft branch somewhere), I'm just using the multicore FIFO. And as I was thinking about this code recently/revisiting old decisions, I realised if I'm using the multi-core FIFO then I probably don't actually need the multicore lilos, I could just run two separate instances of lilos (though that might mean some code duplication). Or at least if the FIFO is the only synchronisation between cores then I could go back to using unsafe-assume-single-core I think (need to think about this more carefully).

I haven't yet played around with https://github.com/rtic-rs/microamp but I think that could be interesting for the asymmetric multi processing case.

@KoviRobi
Copy link
Contributor Author

KoviRobi commented Mar 7, 2024

Ok I have pushed https://github.com/KoviRobi/lilos/blob/kovirobi/task-switch-profiling/examples/rp2040/profiling/src/main.rs which is a small hacky thing I was using to look at some timings, has bugs but I haven't spent much more time on it since writing it a while ago, but (attach RP2040 via USB, enter bootloader mode)

$ cargo r -r -F pa-1core
$ picocom -q /dev/serial/by-id/usb-KoviRobi_Lilos_profile_1234ABCD-if00
Can time: + * i(sb) n(oop) a(tomic load)         # pressed `h` for help, valid keys: `+`, `*`, `n`, `a`, `p`, `b`
p for profile task switch, print some times
b for reset to USB bootloader

atomic load 0 took 161                           # pressed `a` for timing atomic load
atomic load 0 took 9                             # first measurement always wrong
atomic load 0 took 9                             # second measurement is reliable

isb took 55                                      # to get idea of scale, ISB is 3 instructions, plus 2 cycles for load
isb took 5                                       # actual value (first measurement wrong)

FATAL: read zero bytes from port                 # press `b` to enter bootloader
term_exitfunc: reset failed for dev              # closes UART

And when using critical-sections for portable atomics:

$ cargo r -r -F pa-crit
$ picocom -q /dev/serial/by-id/usb-KoviRobi_Lilos_profile_1234ABCD-if00
atomic load 0 took 339
atomic load 0 took 85
atomic load 0 took 85

BTW the shortest time the task switching measure task takes is 315 cycles when using pa-1core and the commit Exec: Use bitops to avoid checking all tasks, 398 cycles without that commit (got a couple of never executed tasks to make that number larger), and 673 with the bitops commit plus critical-section. (The p option in the command line is there to measure this.)

The whole thing could have been written a lot simpler/neater if I had just used defmt-rtt for example, but I was on my laptop which only has one remaining working USB port... And that file was also me experimenting with copy-to-ram for the RP2040 using Rust, but it also compiles using -F flash to something on which defmt-rtt works (haven't looked into it, but copy-to-ram doesn't play with cargo run when using probe-run, almost certainly because the code into memory rather than flash)

(P.S.: If I'm doing something silly feel free to just tell me)

@cbiffle
Copy link
Owner

cbiffle commented Mar 7, 2024

(P.S.: If I'm doing something silly feel free to just tell me)

Don't worry, I would! Nothing silly, this is just a large and subtle change, and I'm having to think about it in small chunks between other obligations. Sorry about the delays. I appreciate the contribution.

@cbiffle
Copy link
Owner

cbiffle commented Mar 7, 2024

In terms of the profiling, I'd encourage you to look at the generated code by disassembling the binary. 85 cycles seems high for an atomic load, but the reason should be apparent in the generated code, and it may be possible to switch the features on critical-section to improve performance for your case. If you have access to ARM binutils I'd specifically suggest doing something like arm-none-eabi-objdump -dCl path/to/your/binary, which will get you demangled Rust symbols and will spam the output full of line numbers in the source code -- making it easier to track down a specific portion of your program that you're curious about. (The name of the ARM objdump may be different on your OS.)

For critical-section using cpsid and cpsie I'd expect probably 12 cycles or less for an atomic operation. It'd be about 6 for the naive version, or slightly more to save and restore the interrupt status (unnecessary work in this case but good practice for a general-purpose critical section library).

Your profiling fixture is neat. I've built similar things when reverse engineering cycle counts in the past. It's interesting that the first test is always wrong; if you're running out of the QSPI flash on the RP2040, that could explain it, since it'll be filling a cache each time you do something unexpected.

@KoviRobi
Copy link
Contributor Author

KoviRobi commented Mar 13, 2024

Don't worry, I would! Nothing silly, this is just a large and subtle change, and I'm having to think about it in small chunks between other obligations. Sorry about the delays. I appreciate the contribution.

Thank you, no worries about the delays, I'm just fairly new to this, only having been out uni for the past 4 years -- and playing with Rust only in the last 1.5 years

arm-none-eabi-objdump -dCl path/to/your/binary, which will get you demangled Rust symbols and will spam the output full of line numbers in the source code

That's cool, I had only come across cargo-binutils which uses llvm-objdump. I have now also managed to get radare2/iaito working, so I have a graph, which for me is nicer to look at -- I needed to set the CPU to cortex

For critical-section using cpsid and cpsie I'd expect probably 12 cycles or less for an atomic operation. It'd be about 6 for the naive version, or slightly more to save and restore the interrupt status (unnecessary work in this case but good practice for a general-purpose critical section library).

This is the multi-core critical section using the spinlocks (https://github.com/rp-rs/rp-hal/blob/v0.9.2/rp2040-hal/src/critical_section_impl.rs), the cpsid/cpsie version is the unsafe-assume-single-core, I think, which is 7 cycles (profiling says 9, but 2 are loading the systick/cycle count).

I have attached my annotated decompile of critical-section acquire using spinlocks which is the one that takes 83 cycles. It is re-entrant (whereas the spinlocks aren't) at the level of the core, that's one reason for the slowness.

sym__critical_section_1_0_acquire

It's interesting that the first test is always wrong; if you're running out of the QSPI flash on the RP2040, that could explain it, since it'll be filling a cache each time you do something unexpected.

I was thinking about this, but this happens even if I do the copy-to-ram version -- I need to double check it is doing what I expect it to do, but I think there should be no XIP cache delays in the copy-to-ram version? Maybe I need to do some more setting up to disable the cache, and currently it is set up wrong

@KoviRobi KoviRobi force-pushed the rp2040-multicore-with-example branch from 8384665 to 80983c0 Compare March 17, 2024 01:50
@KoviRobi
Copy link
Contributor Author

KoviRobi commented Mar 17, 2024

I have rebased onto main@{u}, and reordered the commits so that the first 5 commits are the non-multicore specific ones that you might want to pick. This is to help you, if you have already rebased/picked the commits you wanted to then feel free to use those, that's fine by me. But if you haven't, hopefully this will save you some time.

Though I see now that there are tests, and they are failing -- I will fix that I also was wondering, I could add a QEMU hosted LM3S target for the tests, that way it can be run without physical hardware (no idea if the CI has access to hardware)
Edit: Fixed the original test failures, rebased on top of #19 Add QEMU to test suite which is currently failing because 3816321, see #19 for more details.

I was going to play around with uAMP, see how that ends up looking. Given the spin-lock is doubling the lilos task switch, I don't know if the current multi-core example is that useful, and I believe the FIFO should be safe with two separate single-core portable atomic lilos instances, so it would probably be more useful to have an uAMP example with the FIFO driver.

Also improve portability by using `#!/usr/bin/env bash` instead of just
`#!/bin/bash` (though probably could also be replaced with `#!/bin/sh`,
I just cannot remember what is bashism and what isn't well enough).
E.g. for the RP2040 you need to, in the project you are using lilos in,
add:

        cargo add rp2040-hal -F critical-section-impl
        cargo add critical-section

If you compare the two binaries, you see that the difference is for the
rp2040/minimal example, where the lilos atomics have some `dmb sy`
instructions.

This is a reasonably good way to compare executables, run it on this
commit and the prior commit, output to file and diff using your
favourite diff program. Yes the regexp is a bit horrible.

        cargo objdump --release -- -d --no-show-raw-insn |                                       \
          sed                                                                                    \
            -e 's/^[0-9a-fA-F]\+: //'                                                            \
            -e 's/\(0x\)\?[0-9a-fA-F]\+ <\(\S\+\)::h[0-9a-fA-F]\{16\}\(+0x[0-9a-fA-F]\+\)>/\2\3/' \
            -e 's/\(0x\)\?[0-9a-fA-F]\+ <\(\S\+\)::h[0-9a-fA-F]\{16\}>/\2/'                       \
            -e 's/\(0x\)\?[0-9a-fA-F]\+ <\(\S\+\)\(+0x[0-9a-fA-F]\+\)>/\2\3/'                     \
            -e 's/\(0x\)\?[0-9a-fA-F]\+ <\(\S\+\)>/\2/'                                           \
            -e 's/@ imm = #-\?0x[0-9a-fA-F]\+//'
Useful for custom timers, which can be implemented along the lines of:

```rust
fn make_idle_task<'a>(
    core: &'a mut cortex_m::Peripherals,
    timer_list: Pin<&'a List<Instant>>,
    cycles_per_us: u32,
) -> impl FnMut() + 'a {
    // Make it so that `wfe` waits for masked interrupts as well as events --
    // the problem is that the idle-task is called with interrupts disabled (to
    // not have an interrupt fire before we call the idle task but after we
    // check that we should sleep -- for `wfi` it would just wake up).
    // See
    // https://www.embedded.com/the-definitive-guide-to-arm-cortex-m0-m0-wake-up-operation/
    const SEVONPEND: u32 = 1 << 4;
    unsafe {
        core.SCB.scr.modify(|scr| scr | SEVONPEND);
    }

    // 24-bit timer
    let max_sleep_us = ((1 << 24) - 1) / cycles_per_us;
    core.SYST.set_clock_source(SystClkSource::Core);

    move || {
        match timer_list.peek() {
            Some(wake_at) => {
                let now = tick();
                if wake_at > now {
                    let wake_in_us = u64::min(
                        max_sleep_us as u64,
                        (wake_at - now).to_micros(),
                    );
                    let wake_in_ticks = wake_in_us as u32 * cycles_per_us;
                    // Setting zero to the reload register disables systick --
                    // systick is non-zero due to `wake_at > now`
                    core.SYST.set_reload(wake_in_ticks);
                    core.SYST.clear_current();
                    core.SYST.enable_interrupt();
                    core.SYST.enable_counter();
                    // We use `SEV` to signal from the other core that we can
                    // send more data. See also the comment above on SEVONPEND
                    cortex_m::asm::wfe();
                } else {
                    // We just missed a timer, don't idle
                }
            }
            None => {
                // We use `SEV` to signal from the other core that we can send
                // more data. See also the comment above on SEVONPEND
                cortex_m::asm::wfe();
            }
        }
    }
}
```
- Uses custom timer
- Uses the same WAKE_BITS bitmask given we are already using
  critical-sections, probably not too bad with 2 cores, might degrade
  with more cores
@KoviRobi KoviRobi force-pushed the rp2040-multicore-with-example branch from 80983c0 to bec9219 Compare March 17, 2024 18:26
@cbiffle
Copy link
Owner

cbiffle commented Apr 23, 2024

Ooh, neat, I like radare2. I will look more closely at your disassembly soon.

I think I'm going to try to declare a 1.0.0 release for the main system without the SMP changes, but I would like to see about incorporating SMP support sometime after that.

@KoviRobi
Copy link
Contributor Author

That's totally fair, I still don't think the changes are ready in many ways, and I need to get back to e.g. looking at microAMP -- I had a brief look but it seemed like RTIC had focused away from that for now, and on getting core stuff working.

Also, in wake of the xz utils, it does occur to me that we all need to be better towards open source maintainers, so I'll just be upfront that if it does land, I'm not sure I will have the capacity to help maintain it in the longer term? I seem to be more of a code butterfly, flitting from codebase to codebase. [Currently finding Tiny Tapeout an interesting thing to focus on.]

I mean I think lilos is awesome, I love how small it is, and I would love to understand it more (your article about the async debugging, and lilosdbg, is amazing btw -- I think I'd still need to play around more with it to understand how stackless async's memory usage scales, and how the stack based linked lists stay around, compare it to a pre-emptive OS that needs multiple stacks and register stashing).

[And yes, I do plan to take a look at Hubris too! I've done some Erlang/Elixir in the past, not yet tried Hubris but the bits I've briefly read about seem interesting there.]

But, as a different example, I've ended up being the maintainer of a small number of packages for NixOS because I wanted to play around with those packages, so I packaged them up -- perhaps I should just have packaged it up privately, to avoid it being a maintenance burden for the core NixOS people (who click approve and check the package details -- but the automated updates by ofborg are pretty nifty). [And I do still use NixOS and think the core ideas are awesome, even if I have run into lots of bits with it that are not as polished.]

So I just wanted to be honest about this, in case that changes things. I mean if I do keep using Lilos for my keyboards then that's making it more likely that I'll be around to help, but it's probably better if I don't promise and not be around to help.

Also, congratulations on lilos v1.0! I'll buy you a beer/coffee/tea/etc if you have a donate link anywhere (and want to, of course). Lilos and your blog posts have been very educational :)

@adfernandes
Copy link

It would sure be nice if there were more symmetric multiprocessor microcontrollers. The RP2040 is the only common one that I'm aware of, and I'm concerned that we might produce a design that is too RP2040 specific. All the other multicore microcontrollers I've used are asymmetric, meaning you can't really run parts of the same program on both cores because they have different instruction sets. (In that case you want two separate copies of lilos instead of SMP.)

@cbiffle I just found the MCX N94/N54 MCUs that are almost completely symmetric Cortex-M33s running at the same clock speed, with a cheap developer board that has SWD access.

  • Primary CPU:
    • Cortex-M33
    • TrustZone, MPU, FPU, SIMD, ETM, CTI
  • Secondary CPU:
    • Cortex-M33
    • Barebone

I just bought two, but haven't had time to play with them yet...

@cbiffle
Copy link
Owner

cbiffle commented May 11, 2024

@cbiffle I just found the MCX N94/N54 MCUs that are almost completely symmetric Cortex-M33s running at the same clock speed, with a cheap developer board that has SWD access.

@adfernandes Neat. At first glance those look very similar to the LPC55S69 we use at my dayjob, which is also a dual M33 where one of the two M33s is missing a bunch of useful features. Fortunately lilos doesn't rely on having an MPU!

It would be interesting to see if it has the same bootloader vulnerabilities. But that's another topic. ;-)

@adfernandes
Copy link

those look very similar to the LPC55S69 we use at my dayjob

Oh very aware and have been following all the public issues... I do secure boot and fuse-blowing at my dayjob for Aarch32 and Aarch64, so have an interest following this stuff among silicon vendors.

And I know you know about the Nordic pure-ARM chips...

... but if you're interested in really heterogeneous multiprocessing, Nordic just put out their docs for the nrf54h20, the one with the combined ARM and RISC-V cores!

@cbiffle cbiffle force-pushed the main branch 2 times, most recently from 7e11fe9 to d49a601 Compare May 26, 2024 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants