-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RP2040 multicore support with example #17
base: main
Are you sure you want to change the base?
Conversation
Oh hang on, I realised timers work differently than I thought (manual millisecond timer), I'll fix that soon. (The timers are being incremented N times, where N is the number of |
c7614e7
to
7a38a60
Compare
Since I think it's in a good state now, I'll pop the first couple This should reduce the overall diff |
3b8dc72
to
64e3f8c
Compare
64e3f8c
to
8384665
Compare
Hi! I'm going through the outstanding PRs (whoops!) and this is really neat. I need to take some time to sit and think about the design implications. You've done several possibly-orthogonal things here, each of which is something I've thought about doing (but haven't):
And then of course the SMP part. (I'm actually using lilos with SMP on rp2040 already, but not running the lilos scheduler on the second core -- just using it to do math. Your approach is more general and flexible.) Mostly I need to give some thought to the handling of notification masks and wake bits across cores. I hope to have some time to read your changes in detail over the next few days and get back to you. Thanks for looking into this! |
I actually needed the two cores because I am planning on using one of the cores to debug the other one! And still need USB for the other core's stuff (making a keyboard) so need to get data from one to the other. (Very much WIP, need to fix some stuff with it, maybe even probe-rs/openocd as they don't quite work in different ways andhttps://github.com/KoviRobi/rp2040-selfdebug) I have an example that does profiling (well, just measuring the task switch time) using SysTick, turns out the multicore HW spin lock isn't very fast, for single core you want portable atomics with "unsafe assume single core". Not that I have speed issues, I'm just coding for coding's sake Speaking of, I have implemented the bitmask approach for the task wake loop, haven't profiled it yet. I'll try to rebase this and push those on a different MR probably, if you want. Currently a different project has my hobby attention though (A bit rambling because I'm on a train on my phone and it's just hard to edit/read the whole post, sorry) |
That's surprising, if it's using the hardware feature I'm thinking of ... I'd be interested in looking at the generated code. I would expect it to be on the order of four cycles, but admittedly the RaspPi folks don't specifically describe any wait states etc. So there could be a penalty lurking. I took a look at your handling of I'm kind of inclined to convert all the per-core Aside: It would sure be nice if there were more symmetric multiprocessor microcontrollers. The RP2040 is the only common one that I'm aware of, and I'm concerned that we might produce a design that is too RP2040 specific. All the other multicore microcontrollers I've used are asymmetric, meaning you can't really run parts of the same program on both cores because they have different instruction sets. (In that case you want two separate copies of |
Okay, I've had some more time to read through this and think about it, and there are definitely chunks of this I'm interested in trying to merge -- and some other chunks of it that I'm still thinking carefully about. Are you comfortable with me extracting portions of this into smaller commits as long as I credit you in the commit description? I don't want to steal your thunder here, but I also don't want to commit things I wrote under your name. :-) |
Of course, feel free to split and modify – my git fu is also reasonably advanced so happy to split things up too, but currently just busy with other stuff :) |
Also, this doesn't yet include support for mutexes being multicore (I have a draft branch somewhere), I'm just using the multicore FIFO. And as I was thinking about this code recently/revisiting old decisions, I realised if I'm using the multi-core FIFO then I probably don't actually need the multicore lilos, I could just run two separate instances of lilos (though that might mean some code duplication). Or at least if the FIFO is the only synchronisation between cores then I could go back to using I haven't yet played around with https://github.com/rtic-rs/microamp but I think that could be interesting for the asymmetric multi processing case. |
Ok I have pushed https://github.com/KoviRobi/lilos/blob/kovirobi/task-switch-profiling/examples/rp2040/profiling/src/main.rs which is a small hacky thing I was using to look at some timings, has bugs but I haven't spent much more time on it since writing it a while ago, but (attach RP2040 via USB, enter bootloader mode) $ cargo r -r -F pa-1core
$ picocom -q /dev/serial/by-id/usb-KoviRobi_Lilos_profile_1234ABCD-if00
Can time: + * i(sb) n(oop) a(tomic load) # pressed `h` for help, valid keys: `+`, `*`, `n`, `a`, `p`, `b`
p for profile task switch, print some times
b for reset to USB bootloader
atomic load 0 took 161 # pressed `a` for timing atomic load
atomic load 0 took 9 # first measurement always wrong
atomic load 0 took 9 # second measurement is reliable
isb took 55 # to get idea of scale, ISB is 3 instructions, plus 2 cycles for load
isb took 5 # actual value (first measurement wrong)
FATAL: read zero bytes from port # press `b` to enter bootloader
term_exitfunc: reset failed for dev # closes UART And when using critical-sections for portable atomics: $ cargo r -r -F pa-crit
$ picocom -q /dev/serial/by-id/usb-KoviRobi_Lilos_profile_1234ABCD-if00
atomic load 0 took 339
atomic load 0 took 85
atomic load 0 took 85 BTW the shortest time the task switching measure task takes is The whole thing could have been written a lot simpler/neater if I had just used defmt-rtt for example, but I was on my laptop which only has one remaining working USB port... And that file was also me experimenting with copy-to-ram for the RP2040 using Rust, but it also compiles using (P.S.: If I'm doing something silly feel free to just tell me) |
Don't worry, I would! Nothing silly, this is just a large and subtle change, and I'm having to think about it in small chunks between other obligations. Sorry about the delays. I appreciate the contribution. |
In terms of the profiling, I'd encourage you to look at the generated code by disassembling the binary. 85 cycles seems high for an atomic load, but the reason should be apparent in the generated code, and it may be possible to switch the features on For Your profiling fixture is neat. I've built similar things when reverse engineering cycle counts in the past. It's interesting that the first test is always wrong; if you're running out of the QSPI flash on the RP2040, that could explain it, since it'll be filling a cache each time you do something unexpected. |
Thank you, no worries about the delays, I'm just fairly new to this, only having been out uni for the past 4 years -- and playing with Rust only in the last 1.5 years
That's cool, I had only come across cargo-binutils which uses llvm-objdump. I have now also managed to get radare2/iaito working, so I have a graph, which for me is nicer to look at -- I needed to set the CPU to
This is the multi-core critical section using the spinlocks (https://github.com/rp-rs/rp-hal/blob/v0.9.2/rp2040-hal/src/critical_section_impl.rs), the cpsid/cpsie version is the I have attached my annotated decompile of critical-section acquire using spinlocks which is the one that takes 83 cycles. It is re-entrant (whereas the spinlocks aren't) at the level of the core, that's one reason for the slowness.
I was thinking about this, but this happens even if I do the copy-to-ram version -- I need to double check it is doing what I expect it to do, but I think there should be no XIP cache delays in the copy-to-ram version? Maybe I need to do some more setting up to disable the cache, and currently it is set up wrong |
8384665
to
80983c0
Compare
I have rebased onto
I was going to play around with uAMP, see how that ends up looking. Given the spin-lock is doubling the lilos task switch, I don't know if the current multi-core example is that useful, and I believe the FIFO should be safe with two separate single-core portable atomic lilos instances, so it would probably be more useful to have an uAMP example with the FIFO driver. |
Also improve portability by using `#!/usr/bin/env bash` instead of just `#!/bin/bash` (though probably could also be replaced with `#!/bin/sh`, I just cannot remember what is bashism and what isn't well enough).
E.g. for the RP2040 you need to, in the project you are using lilos in, add: cargo add rp2040-hal -F critical-section-impl cargo add critical-section If you compare the two binaries, you see that the difference is for the rp2040/minimal example, where the lilos atomics have some `dmb sy` instructions. This is a reasonably good way to compare executables, run it on this commit and the prior commit, output to file and diff using your favourite diff program. Yes the regexp is a bit horrible. cargo objdump --release -- -d --no-show-raw-insn | \ sed \ -e 's/^[0-9a-fA-F]\+: //' \ -e 's/\(0x\)\?[0-9a-fA-F]\+ <\(\S\+\)::h[0-9a-fA-F]\{16\}\(+0x[0-9a-fA-F]\+\)>/\2\3/' \ -e 's/\(0x\)\?[0-9a-fA-F]\+ <\(\S\+\)::h[0-9a-fA-F]\{16\}>/\2/' \ -e 's/\(0x\)\?[0-9a-fA-F]\+ <\(\S\+\)\(+0x[0-9a-fA-F]\+\)>/\2\3/' \ -e 's/\(0x\)\?[0-9a-fA-F]\+ <\(\S\+\)>/\2/' \ -e 's/@ imm = #-\?0x[0-9a-fA-F]\+//'
Useful for custom timers, which can be implemented along the lines of: ```rust fn make_idle_task<'a>( core: &'a mut cortex_m::Peripherals, timer_list: Pin<&'a List<Instant>>, cycles_per_us: u32, ) -> impl FnMut() + 'a { // Make it so that `wfe` waits for masked interrupts as well as events -- // the problem is that the idle-task is called with interrupts disabled (to // not have an interrupt fire before we call the idle task but after we // check that we should sleep -- for `wfi` it would just wake up). // See // https://www.embedded.com/the-definitive-guide-to-arm-cortex-m0-m0-wake-up-operation/ const SEVONPEND: u32 = 1 << 4; unsafe { core.SCB.scr.modify(|scr| scr | SEVONPEND); } // 24-bit timer let max_sleep_us = ((1 << 24) - 1) / cycles_per_us; core.SYST.set_clock_source(SystClkSource::Core); move || { match timer_list.peek() { Some(wake_at) => { let now = tick(); if wake_at > now { let wake_in_us = u64::min( max_sleep_us as u64, (wake_at - now).to_micros(), ); let wake_in_ticks = wake_in_us as u32 * cycles_per_us; // Setting zero to the reload register disables systick -- // systick is non-zero due to `wake_at > now` core.SYST.set_reload(wake_in_ticks); core.SYST.clear_current(); core.SYST.enable_interrupt(); core.SYST.enable_counter(); // We use `SEV` to signal from the other core that we can // send more data. See also the comment above on SEVONPEND cortex_m::asm::wfe(); } else { // We just missed a timer, don't idle } } None => { // We use `SEV` to signal from the other core that we can send // more data. See also the comment above on SEVONPEND cortex_m::asm::wfe(); } } } } ```
- Uses custom timer - Uses the same WAKE_BITS bitmask given we are already using critical-sections, probably not too bad with 2 cores, might degrade with more cores
80983c0
to
bec9219
Compare
Ooh, neat, I like radare2. I will look more closely at your disassembly soon. I think I'm going to try to declare a 1.0.0 release for the main system without the SMP changes, but I would like to see about incorporating SMP support sometime after that. |
That's totally fair, I still don't think the changes are ready in many ways, and I need to get back to e.g. looking at microAMP -- I had a brief look but it seemed like RTIC had focused away from that for now, and on getting core stuff working. Also, in wake of the xz utils, it does occur to me that we all need to be better towards open source maintainers, so I'll just be upfront that if it does land, I'm not sure I will have the capacity to help maintain it in the longer term? I seem to be more of a code butterfly, flitting from codebase to codebase. [Currently finding Tiny Tapeout an interesting thing to focus on.] I mean I think lilos is awesome, I love how small it is, and I would love to understand it more (your article about the async debugging, and lilosdbg, is amazing btw -- I think I'd still need to play around more with it to understand how stackless async's memory usage scales, and how the stack based linked lists stay around, compare it to a pre-emptive OS that needs multiple stacks and register stashing). [And yes, I do plan to take a look at Hubris too! I've done some Erlang/Elixir in the past, not yet tried Hubris but the bits I've briefly read about seem interesting there.] But, as a different example, I've ended up being the maintainer of a small number of packages for NixOS because I wanted to play around with those packages, so I packaged them up -- perhaps I should just have packaged it up privately, to avoid it being a maintenance burden for the core NixOS people (who click approve and check the package details -- but the automated updates by ofborg are pretty nifty). [And I do still use NixOS and think the core ideas are awesome, even if I have run into lots of bits with it that are not as polished.] So I just wanted to be honest about this, in case that changes things. I mean if I do keep using Lilos for my keyboards then that's making it more likely that I'll be around to help, but it's probably better if I don't promise and not be around to help. Also, congratulations on lilos v1.0! I'll buy you a beer/coffee/tea/etc if you have a donate link anywhere (and want to, of course). Lilos and your blog posts have been very educational :) |
@cbiffle I just found the MCX N94/N54 MCUs that are almost completely symmetric Cortex-M33s running at the same clock speed, with a cheap developer board that has SWD access.
I just bought two, but haven't had time to play with them yet... |
@adfernandes Neat. At first glance those look very similar to the LPC55S69 we use at my dayjob, which is also a dual M33 where one of the two M33s is missing a bunch of useful features. Fortunately lilos doesn't rely on having an MPU! It would be interesting to see if it has the same bootloader vulnerabilities. But that's another topic. ;-) |
Oh very aware and have been following all the public issues... I do secure boot and fuse-blowing at my dayjob for And I know you know about the Nordic pure-ARM chips... ... but if you're interested in really heterogeneous multiprocessing, Nordic just put out their docs for the nrf54h20, the one with the combined ARM and RISC-V cores! |
7e11fe9
to
d49a601
Compare
Hi,
I hope you had/are having happy holidays if you celebrate it! I've been having lots of fun learning and hacking more about Send/Sync, async and lilos -- thank you for your blog-posts and software :)
This does several things, to get something like https://github.com/rp-rs/rp-hal/blob/main/rp2040-hal/examples/multicore_fifo_blink.rs working.
I'm not an expert on this so it is possible there are better ways to do things -- first it would make sense to read through the description and check my logic makes sense before reviewing the actual code.
cargo fmt
commits so they can be easily dropped if you want, my editor just automatically formats on save so it's easiest to do this upfront, but can now drop those, or you can ignore them by looking at a diff excluding them.DMB SY
instructions, presumably because of the single-core assumption.2core
feature to lilos, see belowlilos
core changes/design considerations2core
because some things are now arrays are initialized to[_; 2]
, it should be just a matter of changing the2
to a different number if there are 4-core microcontrollers etc, I don't imagine there are so many different configurations that it will be a problem. I couldn't find a reasonable way to have Rust/Cargo configure an integer value instead of the boolean on/off of feature flags.WAKE_BITS
, which given we are using multi-core spinlocks is fine. I did consider having aWAKE_BITS
for each core, that made it so thatNotify
had to be only subscribed on that core for the mask to make sense, which seemed less nice.timerawl
/timerawh
on the RP2040 if we wanted a single source of time, but we would have to lock/unlock the linked list at that point.lilosdbg
to go along with this (as well asportable_atomics
, and some other small changes).Send + 'static
restriction indicating multi-core is coming fromrp2040_hal::multicore::Core::spawn
, I think this is correct, as that is what is creating the new thread, notlilos::exec::run_tasks
Timer related considerations
Sorry I've ended up making a fair bit of changes, here is the summary:
timer
feature (on by default), which turns on/off any timers (previouslysystick
did this)systick
feature specifically turns on an implementation oftimer
that usessystick
(this is still default)systick
is ontime.rs
(sleep_until
,sleep_for
,with_deadline
,with_timeout
,PeriodicGate
) is reusable if you implement custom timers.SysTickTimer
torun_*
, and tosleep_*
/etc. I could make this backwards compliant with#[cfg(...)]
macros, but it ends up looking a bit ugly.After you have had a brief look and have decided there are no more big changes, there is still some tidying for me to do: rebuild the STM32* examples and make sure they work (for simplicity I've currently only done the RP2040 ones mostly). Mostly just changing them to use the new named
SysTickTimer