Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support native execution on x86-64 Linux #11

Open
evmar opened this issue May 30, 2024 · 7 comments
Open

Support native execution on x86-64 Linux #11

evmar opened this issue May 30, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@evmar
Copy link
Owner

evmar commented May 30, 2024

Many Linux users will have native x86-64 hardware, and we could use their CPU directly in the same way Rosetta worked, by using the processor's 32-bit compatibility mode.

In other words, the idea here is take the existing x86-64 Mac support and port it to Linux.

I tinkered a bit in this area here:
https://github.com/evmar/retrowin32/compare/linux?expand=1

Some notes:

  • that branch also has tinkering with 32-bit Linux which I think is less valuable because it would mean you need a 32-bit libc etc.

  • the LDT setup on Linux is different; the above branch has some syscally bits to at least print the LDT, but I'm not sure I got it right because I get back an empty array; on Mac my recollection there were entries in there already

  • on Mac I struggled a lot with how the binary gets laid out (see blog post); on Linux it appears the -Ttext linker flag lets you move the text section around and the heap is after that, which is great, but also it appears the first pages of the binary itself (like the file headers) get mapped at exactly 0x40_0000 on Linux which is exactly where Windows exes want to go; StackOverflow suggests this is hardcoded by the linker and we might need to use a linker script to avoid it

For most of this I think the answer will be roughly "dig through Wine to see how they did it".

@evmar
Copy link
Owner Author

evmar commented May 30, 2024

We have some functions defined in assembly.

To link them on Mac they are named e.g. _tramp64 but you write it without the underscore in the Rust extern declaration, this is some cdecl name mangling(?).

To link them on Linux you don't, so I'd somehow have to make the code handle both.

@spiffyguy
Copy link

spiffyguy commented May 31, 2024

We have some functions defined in assembly.

To link them on Mac they are named e.g. _tramp64 but you write it without the underscore in the Rust extern declaration, this is some cdecl name mangling(?).

To link them on Linux you don't, so I'd somehow have to make the code handle both.

Hi @evmar, big fan of what you are creating here. I am not proficient in Rust, Assembly or the win32 api... but I like to think I can read and somewhat get what's going on...

That being said..... I saw this assembly code from a completely different project, petool (MIT Licensed), I thought might apply to the assembly you are writing here.

Notice how in this assembly file incbin.S for the petool, there are both the underscore and the non-underscore definitions for functions, then include the assembly. I believe this was purposely written to allow this code to easily compile on macOS, Linux or Windows (via MinGW).

https://github.com/FunkyFr3sh/petool/blob/master/src/incbin.S

I hope seeing this helps, if not... sorry for the noise.

@evmar evmar added the enhancement New feature or request label Jun 2, 2024
@mateli
Copy link

mateli commented Jun 6, 2024

On an x64 processor the compatibility mode for 32 bit applications provides significant less performance than running 64 bit applications. For example there are no access to the extra registers and x64 SIMD extensions. Furthermore switching between 32 bit compatibility mode and 64 bit mode is expensive.

To make all code faster is premature optimization. For 95% of application code it's not going to make any difference. Finding the 5% where application's spend more of it's time and replacing that with optimized native code would provide more performance.

Take for example your favorite file zipping application. If we run it until it is supposed to run it's implementation of DEFLATE and instead use a zlib-ng library compiled for x64 we will make use of all the x64 optimizations including using all the extra SIMD instructions. The DEFLATE part of running this zipping application will now probably be faster than when the application is running in 32 bit compatibility mode. Other parts of the application is unlikely to matter much. With some probability it's faster than the actual 64 bit version of the same application as we are using what currently is the best DEFLATE implementation.

What I suggest instead is to create an emulator that can do this without doing expensive CPU mode switching. Furthermore enhancing the emulator with good profiling tools so that we can figure out where it spends time and create optimized native code that is faster. Sometimes this is as simple as writing the same code end compiling it. In other slow algorithms can be replaced by faster algorithms. Taking zlib as an example everything with an older and less efficient implementation of zlib would benefit from being redirected to the latest zlib-ng. There is no scenario where DEFLATE will be as fast running in 32 bit mode than when running zlib-ng in x64 mode taking full advantage of SIMD and extra registers.

While an emulator is slower when doing just running an application instructions by instruction it may very well be the fastest way to run applications if the above approach is used. It would even make applications faster if we use it to run 32 bit applications on 32 bit x86 mode. Being able to insert better algorithms and other performance optimizations into an application would have that effect. Even just switching old binary code with something compiled on a newer and better compiler may have the effect of creating great leaps in performance.

There are of course a significant amount of work that needs to be done for this approach. For the DEFLATE use case we would have to investigate all applications that uses it and make sure that we get it to utilize the zlib-ng. But once this is done zipping with be as fast as it can on any hardware. And if there are improvements in zlib-ng or another library replaces it as the best DEFLATE implementation, we can easily upgrade or switch to make all those applications faster. Although on many applications using DEFLATE is a very minor part of what they do and making it faster will not change the overall performance of the application, which is why profiling is important.

@hardBSDk
Copy link
Contributor

hardBSDk commented Jun 9, 2024

The 32-bit compatibility mode was removed in the upcoming x86-S architecture.

@mateli
Copy link

mateli commented Jun 9, 2024

Meaning 32 bit will require emulation or binary translation.

@evmar
Copy link
Owner Author

evmar commented Jul 29, 2024

@cadmic got retrowin32 running on x86-64 Mac, which means we have at least all the CPU-initialization bits in place such that actual x86 hardware believes us. So in theory all that's left for x86-64 Linux is the memory layout and LDT initialization code, probably not too bad?

02fa920

@mateli
Copy link

mateli commented Aug 14, 2024

I am way more interested in being able to run x86-64 applications on ARM. Such as modern Mac and Raspberry Pi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants