Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sysvabi64] document requirement for bti c in more detail #196

Open
nsz-arm opened this issue Mar 27, 2023 · 6 comments · May be fixed by #282
Open

[sysvabi64] document requirement for bti c in more detail #196

nsz-arm opened this issue Mar 27, 2023 · 6 comments · May be fixed by #282

Comments

@nsz-arm
Copy link
Contributor

nsz-arm commented Mar 27, 2023

the text currently has

"An executable or shared library that supports BTI must have a bti c instruction at the start of any entry that might be called indirectly."

but it's not clear if compilers should consider potential linker inserted veneers with indirect call/jump or if the linker should ensure that when a veneer is inserted it does not break bti compatibility.

(gcc+ld.bfd made different choice than llvm+lld)

see discussion at
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106671

@smithp35
Copy link
Contributor

Maybe worth ELF in addition to sysvabi as this would also affect bare-metal (pac-bti M) which would presumably also be affected if GCC was not emitting BTIs for functions that could require a stub.

My reading was that without a specific exception for linker created veneers/stubs code-generators had to assume that one might be created and generate code as if one could be inserted. I can remember clang always generating BTI instructions as it couldn't make an assumption that an indirect branch would be generated by the linker.

I think it is to the benefit of security to have fewer BTIs so having linker stubs that are BTI aware is an overall improvement so it is likely the preferred direction of travel. I think it is worth a wider discussion as IMO to make GCC behaviour not a bug, we would have to add a specific requirement for linkers to be BTI aware in the ABI and no such requirement exists at the moment.

Assuming we can get the agreement to add to the requirement, I'm thinking if there is anything that needs doing about transition. As I understand it:

  • GCC objects + (BFD prior to 106671 or LLD) are at risk of an indirect jump to a non-BTI compatible function.
  • Clang objects always have BTI so are safe with either linker.
    I'm not sure if there is anything we can do as a BTI aware linker will work with both. The only failing case is an older linker with objects with non-BTI compatible functions.

The other thing we may want to address is whether there is any additional marking we can do to make your optimisation possible without disassembling the binary.

@MaskRay
Copy link
Contributor

MaskRay commented Jul 2, 2024

Functions with LR signing gets PACI[AB]SP{,PC}. They have an implicit BTI.
If PACI[AB]SP is absent (leaf functions, or when PAuth is not enabled), Clang adds "bti c" to every candidate function to be compatible with LLD and GNU ld before https://sourceware.org/bugzilla/show_bug.cgi?id=30076 in case range extension thunks (aka veneers aka stubs) are needed (https://reviews.llvm.org/D99417).

I assume that the LLD work is planned and Clang will eventually remove the "bti c" (BTW -fbinutils-version= exists if compatibility with older GNU ld versions is needed).
Is there more information about the double veneer scheme used by GNU ld. Do we need a new relocation to mark "bti c"?
(If there is concern with a new relocation type, NONE with a custom addend might be utilized.)

@smithp35
Copy link
Contributor

smithp35 commented Jul 3, 2024

We've got an idea of where we want to go with this, I've been wanting to have an implementation in LLD ready before publishing and have not been able to find time to do this.

The change that needs making should make clear the requirements for code-generators and static linkers. The prevailing opinion within Arm is that we would like to enable code-generators to omit BTI if they can prove that the function will never be called indirectly (GCC behaviour). A static linker may therefore not assume that all indirect branch targets have a BTI compatible landing pad.

A "BTI compatible" thunk either doesn't use an indirect branch (chain of direct branches) or they are split up into two parts, the indirect branch, and a "header" that contains a BTI c, and ends with a direct branch. Something like:

caller:
  bl thunk_to_foo
  ...
thunk_to_foo:
  adrp x16, foo_bti_header
  add  x16, :lo12: foo_bti_header
  br   x16
  ...
foo_bti_header:
  bti  c
  b    foo
  ...
foo:

The "header" has a range limit (+-128Mib), and is essentially an alternative entry point for indirect calls. The presence of this alternative entry point undoes the compiler's hard work in omitting the BTI, but it will only be done if necessary.

As these "BTI compatible" thunks are larger and slower than normal we would want to only generate these when necessary. GNU ld has decided to disassemble the code at the destination. While this is an option, and is the most precise solution, if there are a lot of thunks then this could affect linker performance. If there are only a few then it probably doesn't matter.

I am hoping that I can find some heuristics that would let a linker decide based on symbol information so that the need for disassembly is lessened. Assuming GCCs implementation doesn't already break this, it could be possible to say that eliding BTI is only permitted for symbols with STB_LOCAL binding. This would reduce the number of candidates a static linker would need to disassemble to check for a BTI (or just assume it doesn't have one).

@nsz-arm
Copy link
Contributor Author

nsz-arm commented Jul 3, 2024

additional details: multiple calls can share the same thunk and multiple thunks may share the same 'header'. and sometimes the header is already within reach of a call (even though the call target is not) and then the header is called directly (which actually would not even need a bti c, unless it is shared with an indirect thunk, bfd ld does not avoid bti c in this case). iirc the veneers are aligned up to 8byte boundary so branches and branch targets are not too close and thus a chain of single branches could take 8byte per veneer instead of just 4 (but such design would avoid any bti so could be safer and still less code if the distances are not too big: <= 3 direct jumps away. this was not tried in bfd ld).

@Wilco1
Copy link
Contributor

Wilco1 commented Jul 3, 2024

Yes if veneer insertion was a bit smarter, it could handle all ranges up to +-256MB using a single direct branch, or +-384MB using 2 direct branches. For even larger binaries it isn't worth worrying about avoiding the BTI header (since the extra size is negligible), and you could delay the final decision of the target of the indirect branch late during relocation when disassembly will be cheaply available.

@smithp35
Copy link
Contributor

smithp35 commented Jul 3, 2024

LLD can do a limited form of inserting 1 direct branch, but due to restrictions on the placement of the branch it doesn't get the full 128 MiB extra range.

Inserting a chain of branches could be possible but it would add quite a bit of complexity to the existing implementation as there are limited points where the linker can insert the branch, as well as needing to insert thunks across output section boundaries.

The additional, unneeded BTI headers could be used as a landing pad by an attacker, but it would still be fewer landing pads than if the compiler always added BTI. I'll have a think about that when doing the LLD implementation.

smithp35 added a commit to smithp35/abi-aa that referenced this issue Sep 17, 2024
Add requirements for when a tool must generate a BTI instruction.
This permits tools to elide BTI instructions when they can prove that
no indirect branch to that location is possible from local information
available to the tool.

Static linkers are not allowed to assume that all direct branch
targets have a BTI instruction. If a veneer is required then the
static linker must generate additional BTI instructions if needed.

A static linker is allowed to assume that a symbol that is exported
to the dynamic symbol table has a BTI instruction.

In practice this will permit compilers to remove BTI instructions from
static functions that do not have their address taken and that address
escapes the function.

This matches the behavior of the GNU toolchain.

Fixes ARM-software#196
@smithp35 smithp35 linked a pull request Sep 17, 2024 that will close this issue
smithp35 added a commit to smithp35/llvm-project that referenced this issue Sep 17, 2024
When Branch Target Identification BTI is enabled all indirect
branches must target a BTI instruction. A long branch thunk
is a source of indirect branches. To date LLD has been
assuming that the object producer is responsible for putting
a BTI instruction at all places the linker might generate an
indirect branch to. This is true for clang, but not for GCC.
GCC will elide the BTI instruction when it can prove that
there are no indirect branches from outside the translation
unit(s). GNU ld was fixed to generate a landing pad stub
(gnu ld speak for thunk) for the destination when a long
range stub was needed [1].

This means that using GCC compiled objects with LLD may
lead to LLD generating an indirect branch to a location
without a BTI. The ABI [2] has also been clarified to say
that it is a static linker's responsibility to generate
a landing pad when the target does not have a BTI.

This patch implements the same mechansim as GNU ld. When
the output ELF file is setting the
GNU_PROPERTY_AARCH64_FEATURE_1_BTI property, then we check
the destination to see if it has a BTI instruction. If it
does not we generate a landing pad consisting of:
BTI c
B <destination>

The B <destination> can be elided if the thunk can be placed
so that control flow drops through. For example:
BTI c
<destination>:
This will be common when -ffunction-sections is used.

The landing pad thunks are effectively alternative entry
points for the function. Direct branches are unaffected
but any linker generated indirect branch needs to use
the alternative. We place these as close as possible
to the destination section.

There is some further optimization possible. Consider the
case:
.text
fn1
...
fn2
...

If we need landing pad thunks for both fn1 and fn2 we could
order them so that the thunk for fn1 immediately precedes fn1.
This could save a single branch. However I didn't think that
would be worth the additional complexity.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106671
[2] ARM-software/abi-aa#196
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants