Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove JitGenericHandleCache #106843

Merged
merged 12 commits into from
Sep 18, 2024

Conversation

davidwrighton
Copy link
Member

@davidwrighton davidwrighton commented Aug 22, 2024

  • It was only used for overflow scenarios from the generic dictionary (which don't happen), virtual resolution scenarios for creating delegates, and a few other rare R2R scenarios
  • Replace the virtual resolution scenarios with a cache of the affected data in managed code, and move the helpers to managed
    • Use the GenericCache type which was previously only used on NativeAOT for Generic Virtual method resolution paths.
  • Just remove the pointless checks from within the various normal generic lookup paths, and since they no longer do anything interesting in their hot paths except erect a helper method frame and call a worker routine, move those jit helpers to managed as well.

Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

private static unsafe IntPtr VirtualFunctionPointerSlowpath(object obj, IntPtr classHandle, IntPtr methodHandle)
{
IntPtr result = JIT_ResolveVirtualFunctionPointer(ObjectHandleOnStack.Create(ref obj), classHandle, methodHandle);
lock(s_virtualFunctionPointerCache)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regular locks can call back into user code through SynchronizationManager. It would be better for this to use Lock with useTrivialWaits=true.

@davidwrighton

This comment was marked as resolved.

@EgorBot

This comment was marked as resolved.

@EgorBo

This comment was marked as resolved.

@davidwrighton

This comment was marked as resolved.

@EgorBot

This comment was marked as resolved.

- It was only used for overflow scenarios from the generic dictionary (which don't happen), virtual resolution scenarios for creating delegates, and a few other rare R2R scenarios
- Replace the virtual resolution scenarios with a cache of the affected data in managed code, and move the helpers to managed
- Just remove the pointless checks from within the various normal generic lookup paths

Swap to using GenericCache
- Make FlushCurrentCache public on GenericCache, as we need to clear this when we clean up a LoaderAllocator
- Tweak algorithm for computing out of bound slots on the DictionaryLayout

Update crsttypes
@davidwrighton
Copy link
Member Author

@EgorBot -intel -arm64

using BenchmarkDotNet.Attributes;
using System.Runtime.CompilerServices;

namespace VirtualDispatchChecks
{

    class _1
    {
        public virtual string Method() => "A";
    }

    class _2 : _1
    {
        public override string Method() => "B";
    }

    class _3 : _1
    {
        public override string Method() => "C";
    }

    class _4 : _1
    {
        public override string Method() => "D";
    }

    class _5 : _1
    {
        public override string Method() => "E";
    }
    class _6 : _1
    {
        public override string Method() => "F";
    }
    class _7 : _1
    {
        public override string Method() => "F";
    }

    class _8 : _1
    {
        public override string Method() => "F";
    }

    class _9 : _1
    {
        public override string Method() => "F";
    }

    class _10 : _1
    {
        public override string Method() => "F";
    }

    public class ResolveVirtualRepeatedly
    {
        static _1 _1 = new _1();
        static _1 _2 = new _2();
        static _1 _3 = new _3();
        static _1 _4 = new _4();
        static _1 _5 = new _5();
        static _1 _6 = new _6();
        static _1 _7 = new _7();
        static _1 _8 = new _8();
        static _1 _9 = new _9();
        static _1 _10 = new _10();
        [MethodImpl(MethodImplOptions.NoInlining)]
        static _1 GetA(int index)
        {
            switch (index % 10)
            {
                case 0:
                    return _10;
                case 1: return _1;
                case 2: return _2;
                case 3: return _3;
                case 4: return _4;
                case 5: return _5;
                case 6: return _6;
                case 7: return _7;
                case 8: return _8;
                case 9: return _9;
            }
            return null!;
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        static Func<string> GetAction(int index)
        {
            return GetA(index).Method;
        }

        [Benchmark]
        public void Get1Virtual()
        {
            GetAction(0);
        }

        [Benchmark]
        public void Get10Virtual()
        {
            GetAction(0);
            GetAction(1);
            GetAction(2);
            GetAction(3);
            GetAction(4);
            GetAction(5);
            GetAction(6);
            GetAction(7);
            GetAction(8);
            GetAction(9);
        }
    }
}

@EgorBot
Copy link

EgorBot commented Aug 27, 2024

Benchmark results on Intel
BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 8 logical and 4 physical cores
  Job-UAKKKJ : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-SUOWDN : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain Mean Error Ratio
Get1Virtual Main 16.24 ns 0.228 ns 1.00
Get1Virtual PR 17.15 ns 0.188 ns 1.06
Get10Virtual Main 164.94 ns 3.211 ns 1.00
Get10Virtual PR 179.95 ns 3.583 ns 1.09

BDN_Artifacts.zip

@EgorBot
Copy link

EgorBot commented Aug 27, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-GBFISH : .NET 10.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-UCOFWF : .NET 10.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Mean Error Ratio
Get1Virtual Main 20.61 ns 0.022 ns 1.00
Get1Virtual PR 21.18 ns 0.149 ns 1.03
Get10Virtual Main 215.70 ns 0.238 ns 1.00
Get10Virtual PR 308.22 ns 0.914 ns 1.43

BDN_Artifacts.zip

@EgorBo
Copy link
Member

EgorBo commented Aug 27, 2024

PS: I have fixed the patch applying so commit squashing is no longer needed. Also, you can use -perf to get output of perf for a benchmark (only one [Benchmark] is expected) before/after.

@EgorBot -arm64 -perf

using BenchmarkDotNet.Attributes;
using System.Runtime.CompilerServices;

namespace VirtualDispatchChecks
{

    class _1
    {
        public virtual string Method() => "A";
    }

    class _2 : _1
    {
        public override string Method() => "B";
    }

    class _3 : _1
    {
        public override string Method() => "C";
    }

    class _4 : _1
    {
        public override string Method() => "D";
    }

    class _5 : _1
    {
        public override string Method() => "E";
    }
    class _6 : _1
    {
        public override string Method() => "F";
    }
    class _7 : _1
    {
        public override string Method() => "F";
    }

    class _8 : _1
    {
        public override string Method() => "F";
    }

    class _9 : _1
    {
        public override string Method() => "F";
    }

    class _10 : _1
    {
        public override string Method() => "F";
    }

    public class ResolveVirtualRepeatedly
    {
        static _1 _1 = new _1();
        static _1 _2 = new _2();
        static _1 _3 = new _3();
        static _1 _4 = new _4();
        static _1 _5 = new _5();
        static _1 _6 = new _6();
        static _1 _7 = new _7();
        static _1 _8 = new _8();
        static _1 _9 = new _9();
        static _1 _10 = new _10();
        [MethodImpl(MethodImplOptions.NoInlining)]
        static _1 GetA(int index)
        {
            switch (index % 10)
            {
                case 0:
                    return _10;
                case 1: return _1;
                case 2: return _2;
                case 3: return _3;
                case 4: return _4;
                case 5: return _5;
                case 6: return _6;
                case 7: return _7;
                case 8: return _8;
                case 9: return _9;
            }
            return null!;
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        static Func<string> GetAction(int index)
        {
            return GetA(index).Method;
        }

        [Benchmark]
        public void Get10Virtual()
        {
            GetAction(0);
            GetAction(1);
            GetAction(2);
            GetAction(3);
            GetAction(4);
            GetAction(5);
            GetAction(6);
            GetAction(7);
            GetAction(8);
            GetAction(9);
        }
    }
}

@EgorBot
Copy link

EgorBot commented Aug 27, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.14.0, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-YQNLAB : .NET 10.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-AMUJHL : .NET 10.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Mean Error Ratio
Get10Virtual Main 229.3 ns 2.22 ns 1.00
Get10Virtual PR 246.1 ns 1.62 ns 1.07

BDN_Artifacts.zip

Flame graphs: Main vs PR 🔥
Hot asm: Main vs PR
Hot functions: Main vs PR

For clean perf results, make sure you have just one [Benchmark] in your app.

@davidwrighton
Copy link
Member Author

An analysis of these results indicates that

  1. While universally slower with the new code, we are close enough for the performance of this relatively rare scenario.
  2. Arm64 showed significantly lower performance once, but that appears to be a case of unlucky hash values in the hash function. Since the has function is not meaningfully changed with this PR, this is considered immaterial.
  3. When running in actual use, this should be slightly faster. Benchmark.NET results are gathered with tiered compilation disabled, and the codegen for the function is slightly better when tiered.

@davidwrighton davidwrighton marked this pull request as ready for review August 27, 2024 22:28
@EgorBo

This comment was marked as resolved.

@dotnet dotnet deleted a comment from EgorBot Aug 27, 2024
@dotnet dotnet deleted a comment from EgorBot Aug 27, 2024
@@ -34,6 +34,8 @@ internal sealed partial class LoaderAllocatorScout
if (m_nativeLoaderAllocator == IntPtr.Zero)
return;

VirtualDispatchHelpers.ClearCache();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a subtle race condition:

  1. the loader allocator is still in use when VirtualDispatchHelpers.ClearCache run
  2. the thread is rescheduled right VirtualDispatchHelpers.ClearCache
  3. the loader allocator becomes unreferenced
  4. Destroy succeeds

I think we will end up with stale entries in the hashtable.

I am wondering whether this is the reason for the casting cache flush to be done while the runtime is suspended: https://github.com/dotnet/runtime/pull/106843/files#diff-060f68aac95f40dd380ee5a82dd047222232b25646f908c1ea394438e18a8709R604

…parameters for function pointer resolution

- This will hopefully catch up performance with the previous implementation, and indicate whether such a change is worthwhile

Move cache clearing to native code to address concern from Jan around LoaderAllocator cleanup
@davidwrighton
Copy link
Member Author

@EgorBot -intel -arm64

using BenchmarkDotNet.Attributes;
using System.Runtime.CompilerServices;

namespace VirtualDispatchChecks
{

    class _1
    {
        public virtual string Method() => "A";
    }

    class _2 : _1
    {
        public override string Method() => "B";
    }

    class _3 : _1
    {
        public override string Method() => "C";
    }

    class _4 : _1
    {
        public override string Method() => "D";
    }

    class _5 : _1
    {
        public override string Method() => "E";
    }
    class _6 : _1
    {
        public override string Method() => "F";
    }
    class _7 : _1
    {
        public override string Method() => "F";
    }

    class _8 : _1
    {
        public override string Method() => "F";
    }

    class _9 : _1
    {
        public override string Method() => "F";
    }

    class _10 : _1
    {
        public override string Method() => "F";
    }

    public class ResolveVirtualRepeatedly
    {
        static _1 _1 = new _1();
        static _1 _2 = new _2();
        static _1 _3 = new _3();
        static _1 _4 = new _4();
        static _1 _5 = new _5();
        static _1 _6 = new _6();
        static _1 _7 = new _7();
        static _1 _8 = new _8();
        static _1 _9 = new _9();
        static _1 _10 = new _10();
        [MethodImpl(MethodImplOptions.NoInlining)]
        static _1 GetA(int index)
        {
            switch (index % 10)
            {
                case 0:
                    return _10;
                case 1: return _1;
                case 2: return _2;
                case 3: return _3;
                case 4: return _4;
                case 5: return _5;
                case 6: return _6;
                case 7: return _7;
                case 8: return _8;
                case 9: return _9;
            }
            return null!;
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        static Func<string> GetAction(int index)
        {
            return GetA(index).Method;
        }

        [Benchmark]
        public void Get1Virtual()
        {
            GetAction(0);
        }

        [Benchmark]
        public void Get10Virtual()
        {
            GetAction(0);
            GetAction(1);
            GetAction(2);
            GetAction(3);
            GetAction(4);
            GetAction(5);
            GetAction(6);
            GetAction(7);
            GetAction(8);
            GetAction(9);
        }
    }
}

@jkotas
Copy link
Member

jkotas commented Sep 17, 2024

Benchmark results on Intel

I am not able to reproduce this regression on Dev Box Amd. The PR is ~7% faster compared to baseline on that machine.


[MethodImpl(MethodImplOptions.NoInlining)]
[DebuggerHidden]
private static unsafe IntPtr VirtualFunctionPointerSlowpath(object obj, IntPtr classHandle, IntPtr methodHandle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static unsafe IntPtr VirtualFunctionPointerSlowpath(object obj, IntPtr classHandle, IntPtr methodHandle)
private static unsafe IntPtr VirtualFunctionPointerSlow(object obj, IntPtr classHandle, IntPtr methodHandle)

Nit: It is the common naming convention.

(Alternatively, path should be Path.)

@@ -476,6 +477,8 @@ static const Entry s_QCall[] =
DllImportEntry(EHEnumNext)
DllImportEntry(AppendExceptionStackFrame)
#endif // FEATURE_EH_FUNCLETS
DllImportEntry(JIT_ResolveVirtualFunctionPointer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would drop the JIT_ prefix. It is not used with any JIT helper that was converted to managed.

Comment on lines 25 to 28
public int HashCode;
public MethodTable* MethodTable;
public IntPtr ClassHandleTargetMethod;
public IntPtr MethodHandle;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public int HashCode;
public MethodTable* MethodTable;
public IntPtr ClassHandleTargetMethod;
public IntPtr MethodHandle;
public int _hashCode;
public MethodTable* _methodTable;
public IntPtr _classHandleTargetMethod;
public IntPtr _methodHandle;

Use the BCL naming convention? Also, the BCL convention is to have fields first in the type.

private static unsafe IntPtr VirtualFunctionPointerSlowpath(object obj, IntPtr classHandle, IntPtr methodHandle)
{
IntPtr result = JIT_ResolveVirtualFunctionPointer(ObjectHandleOnStack.Create(ref obj), classHandle, methodHandle);
s_virtualFunctionPointerCache.TrySet(new VirtualResolutionData(RuntimeHelpers.GetMethodTable(obj), classHandle, methodHandle), result);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a bit suspect that we are passing obj into the resolve worker, but then only use RuntimeHelpers.GetMethodTable(obj) as the cache key. Do we have correct behavior for dynamic castable here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NativeAOT follows this rule around not caching results (for interface dispatch, but appears to have the same problem for virtual function resolution) CoreCLR does not appear to me to have logic in its dispatch behavior which avoids caching. Possibly we intended to build it, but I'm pretty sure we didn't. I can build a test case to validate my expectation if you'd like. The performance of the existing IDynamicInterfaceCastable implementation would likely be highly impacted if we change the rules here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result of this is a useable model, although it's not well defined in behavior. The result is that if any type uses IDynamicInterfaceCastable ... the implementor of IDynamicInterfaceCastable.GetInterfaceImplementaiton really should provide an idempotent implementation for any case where it will successfully return an implementation, or they may have surprising behavior. However, we DO have the correct cache avoidance at cast time. This means that users of types which use IDynamicInterfaceCastable experience the expected behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NativeAOT follows this rule around not caching results (for interface dispatch

We may want to fix NativeAOT to match CoreCLR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I've opened #107999 to track this issue

{
// We can't use GetCurrentStaticAddress, as that may throw, since it will attempt to
// allocate memory for statics if that hasn't happened yet. But, since we force the
// statics memory to be allocated before initializing g_VirtualCache or g_hVirtualCache2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// statics memory to be allocated before initializing g_VirtualCache or g_hVirtualCache2
// statics memory to be allocated before initializing g_VirtualCache

_ASSERTE(!staticTH.IsCanonicalSubtype());
}
}

pStaticMD->CheckRestore();

// MDIL: If IL specifies callvirt/ldvirtftn it remains a "virtual" instruction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this comment & code be deleted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're supposed to support this sort of adjustment as part of ReadyToRun (since we support converting a non-virtual method into a virtual method); however, I just checked, and we don't actually have testing for the case where we construct a delegate from the target method, and the behavior at runtime is incorrect in this scenario. (We do have tests for actual dispatch conversion into a virtual method.) I intend to update the comment here, but leave the logic in place, and file a bug about the R2R behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh. Nevermind my previous analysis.

  1. Converting from a non-virtual to a virtual in ReadyToRun is not a safe change from the point of view of ReadyToRun. See https://learn.microsoft.com/en-us/dotnet/core/compatibility/library-change-rules for the rules that we actively support, so an affirmative test in C# should not be written.
  2. However, we still need to ensure that code doesn't fail in surprising ways with ReadyToRun, so we still need this logic if a customer converts from Virtual to Non-Virtual, which is ALSO a breaking change, but we can hit it if a customer fails to follow all of the rules in the above document. Keeping the existing behavior is done so that failure modes of ReadyToRun generated code are still broadly similar to the failure modes of non ReadyToRun code.

Comment on lines 1917 to 1918
// Factored out most of the body of JIT_GenericHandle so it could be called easily from the CER reliability code to pre-populate the
// cache.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Factored out most of the body of JIT_GenericHandle so it could be called easily from the CER reliability code to pre-populate the
// cache.

HashDatum res;
if (g_pJitGenericHandleCache->GetValueSpeculative(&key, &res))
return (CORINFO_GENERIC_HANDLE)(DictionaryEntry)res;
Volatile<FieldDesc*> g_VirtualCache;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Volatile<FieldDesc*> g_VirtualCache;
Volatile<FieldDesc*> g_pVirtualFunctionPointerCache;

}
public int HashCode;
public MethodTable* MethodTable;
public IntPtr ClassHandleTargetMethod;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public IntPtr ClassHandleTargetMethod;
public IntPtr ClassHandle;

JIT_ResolveVirtualFunctionPointer and most other places use classHandle. It would be nice to use the same name throughout.

If you would like to add Target to the name to disambiguate it from MethodTable, I would add it to both method handle and class handle args to make it clear that they go together, like _targetMethodHandle and _targetMethodClassHandle.

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public bool Equals(VirtualResolutionData other) =>
HashCode == other.HashCode &&
MethodTable == other.MethodTable &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HashCode should filter out nearly all mismatches, so the rest of the method can assume that the equality check is most likely going to return true. It may help on some architectures to turn the rest of the method to just one conditional branch, something like:

        public bool Equals(VirtualResolutionData other) =>
            HashCode == other.HashCode &&
            (((nint)MethodTable - (nint)other.MethodTable) |
            (ClassHandleTargetMethod - other.ClassHandleTargetMethod) |
            (MethodHandle - other.MethodHandle)) == 0;

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@davidwrighton davidwrighton merged commit 128fdfd into dotnet:main Sep 18, 2024
149 of 151 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants