Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration performance optimization: de-broadcast dipole integration #258

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Lazersmoke
Copy link
Contributor

Broadcasting can be (very) slow compared to the for loops this PR introduces, especially in cases where the size of the array is something like 1x1x1x1 (requires 4 dimensions of broadcasting codegen, bounds checks, etc that doesn't need to be there). You can diagnose this is happening if you see lots of calls to broadcast.jl, materialize!: line ### in the @profview profiling window originating from lines with @. inside Integrators.jl, #step!.

For my benchmark (not compatible with Sunny#main) I get a modest speedup (attached screenshot)
image

@Lazersmoke
Copy link
Contributor Author

Lazersmoke commented Apr 30, 2024

Minimal code to reproduce:

using Sunny, LinearAlgebra, Chairmarks

cryst = Crystal(I(3), [[0.,0,0]],1)
sys = System(cryst,(1,1,1), [SpinInfo(1;S=1/2, g=2.5)], :dipole)
set_external_field!(sys,[0,0.3,0])
int = ImplicitMidpoint(0.001; damping=0.1, atol=1e-12)
@b (for j = 1:100_000; step!(sys,int); end)

image

image

@kbarros
Copy link
Member

kbarros commented Apr 30, 2024

This is an interesting observation. To some extent I can reproduce the slowdown.

Here is a simpler stand-alone test we could report to Julia developers. Interestingly, it is not always the case that broadcasting is bad. In fact, sometimes it beats for loops.

using StaticArrays, Chairmarks
const Vec3 = SVector{3, Float64}

# An extreme case: Broadcasting over vectors of length 1.
Δs = randn(Vec3, 1)
s = randn(Vec3, 1)
∇E = randn(Vec3, 1)

# For loops win for this particular kernel
function f1(Δs, s, ∇E)
    for iters in 1:10_000
        for i in eachindex(Δs)
            Δs[i] = -s[i] × (s[i] × ∇E[i])
        end
    end
end
function f2(Δs, s, ∇E)
    for iters in 1:10_000
        @. Δs = -s × (s × ∇E)
    end
end
@b f1(Δs, s, ∇E) # 28.625 μs
@b f2(Δs, s, ∇E) # 34.041 μs


# Broadcasting (very slightly) wins for a simpler kernel
function g1(Δs, s, ∇E)
    for iters in 1:10_000
        for i in eachindex(Δs)
            Δs[i] = -s[i] × ∇E[i]
        end
    end
end
function g2(Δs, s, ∇E)
    for iters in 1:10_000
        @. Δs = -s × ∇E
    end
end
@b g1(Δs, s, ∇E) # 25.500 μs
@b g2(Δs, s, ∇E) # 23.958 μs

@Lazersmoke
Copy link
Contributor Author

This appears to depend mainly on the number of [i] (i.e. how many bounds checks). For our case at least, it can be very efficient to use MVector for Δs and SVector for s; I get 4x speedup on g1 in Kip's benchmark. But it means we have to get the typing information in there somewhere. I'm not sure we want to store this in the type parameters of e.g. System, but if there's a way to convert sys.dipoles into an MVector before passing to rhs_dipole!, that would be highly highly advantageous.

@kbarros
Copy link
Member

kbarros commented Jul 10, 2024

Might be worth also experimenting with this: https://github.com/YingboMa/FastBroadcast.jl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants