-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core.simd and gl3n #75
Comments
Yes, that would make a lot of sense, back when I wrote gl3n and simd came up, I was waiting on std.simd, but that never happened ... |
I made a fork and will try to see if i can implement that somehow. I wouldn't count on me though, i don't have too much time :/ |
So I think there is a problem that will start to exsist if simd is used to replace the non-simd math.
vs
The measured op was
That slowdown becomes even worse if i use automated vectorization Now i changed the code a bit
That results in the expected (even though tiny) speedup.
vs
These differences exsist because the vectors need to be loaded into the simd registers. So operations on the same set of vectors will speed up alot, the general use will slow down alot. |
Having a separate struct for Vector/Matrix/Quaternion could make sense, depending on how different is from the code right now, otherwise just a flag passed to the constructor of the structs, making it possible to have both versions. Code which wants to accept both versions of vectors needs to use |
I tried to integrate it into the normal vector classes via an template argument and in itself that works fine. However i am not able to notice any speedup at all (tbh i only implemented the basic operations and testet them) - and i suspect the implementation of both the core.simd.Vector types and the __simd magic cause a lot of copying around of data, which is kinda locigal because the instructions run on the xmm/ymm registers.
Should, in my understanding, run faster if a,b,c use core.simd.vector!(float[4]) - it however always ran slower than what i expected. It would be nice to work with data within these registers like you can with the Intel/C++ compiler foo (the _mm_add_ss -like functions that take __m128 and __m256 types). So Id go so far to seperate the normal vector/matrix types from the SIMD acceleration completly. So then you would do something like
The main difference would be that the simd-type (wich idially would equal a media register) allows no direct access to the memory to avoid any copying. Also im kinda missing the AVX (256bit ymm registers, double[4]-stuff) support in core.simd.__simd. Hmm. Am I thinking right or am I blubbering complete bullshit? O.o EDIT: I might have found the reason: this code:
Can be split in two parts. The first one is the assignment:
Well ok it also copies the stuff onto the stack? meh. Now the math in the loop:
OUCH! This should simply be
I guess i should report that as compiler bug? https://issues.dlang.org/show_bug.cgi?id=16605 |
Thanks for looking into all of this. I can't really help you here since my knowledge of SSE/SIMD instructions is very limited, you might want to ask in #D on freenode, there are some very smart people with compiler insight who probably can help you in a timely manner. |
No Problem, I enjoy this kind of stuff :) Im gonna head there, because im still not sure if my knowleadge about the SSE/SIMD is enough to come to the right conclusions. Lets see where this is headed! |
It was me who was the fool! "-release" != "-O -release -boundscheck=off"
|
That's almost 3 times faster! |
It gets better!
Im gonna clean this up a bit and make push to my fork so you can take a look at it - if it fits the guidelines/way you wan't stuff to be done for gl3n. |
Here are my changes so far: master...mkalte666:master I know that this is missing tests etc. These i will write as soon as i can i guess. The speed test tool i used is https://github.com/mkalte666/gl3nspeed You have to compile gl3n/gl3nspeed with "DLFAGS="-release -O -boundscheck=ff" dub ". Or tell me how i can get dub to use -O xD |
Looks good, minor style things but in general I like how it is done! You gonna look into matrices as well? |
Thanks, im trying ^^
If i find the time. Im not sure how well that can be done and what instructions already exsist that could help out. Also I still want look into #68, and i guess that could be combined Thinking about speed and not about time management on my side this would be a massive improvement however: "4x4 matrix multiplication is 64 multiplications and 48 additions. Using SSE this can be reduced to 16 multiplications and 12 additions (and 16 broadcasts)" http://stackoverflow.com/questions/18499971/efficient-4x4-matrix-multiplication-c-vs-assembly |
One thing I wonder is if operations with scalars (vec3*float etc) should be vectorized. While the operation itself would speed up, as long as the numerical value is not const, the resulting code would almost always be slower because the scalar would have to be loaded into a vector beforehand. The speedy way of doing a (any operation) multiplicaion would be to hold a (const?) vector somewhere and then do the operations. So doing
would almost always result in faster code than if one would do
because the operator doesn't know if it operates on a const value or a variable. If there is a way to seperate them (detecting if a value is const), that it could be done though - I don't know how. |
Greetings.
I wonderd if it would be usefull to use https://dlang.org/spec/simd.html in gl3n. This could come in handy if you use gl3n alot in collision detection or something similar.
Would that make sense to do?
The text was updated successfully, but these errors were encountered: