Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfcs: add proposal for collecting and analyzing synthetic data #1999

Open
wants to merge 2 commits into
base: rfcs
Choose a base branch
from

Conversation

rjoursler
Copy link
Contributor

Document Link
This RFC proposes new tooling to enable a data pipeline on synthetic data analyze and validate oneDNN performance.

@rjoursler rjoursler added the RFC A design document label Jul 18, 2024
Comment on lines +221 to +225
The second modification that will be needed for benchdnn is a new low-fidelity
batched benchmarking mode. This mode will would only execute each workload once,
with the expectation that run-to-run variability is being handled by executing
multiple similar workloads. The benchmarking is expected to be performed as
follows on GPU, although some tweaks may be required to get performance stability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this functionality is currently present in benchdnn via the --fix-times-per-prb knob, we could set it to 1 or some other small value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with the --fix-times-per-prb knob, is that it is not setup to keep the device saturated with work. This could cause issues with benchmarking if the target device enters some kind power saving mode while oneDNN primitives are being created. To avoid that, benchdnn currently performs a warmup phase for each workload. This warmup should be unnecessary in a batched benchmarking mode and as the goal here is to execute a primitive exactly once, the existing warmup phase would be a significant contributor to benchmarking throughput.

Comment on lines +138 to +146
* Automated Diffing: As one of the relevant metrics is a comparison between
oneDNN builds, this tool needs a way to execute multiple oneDNN libraries. As
a consequence, we can add a testing API for comparing implementations. When
implementations match between oneDNN builds, we can often skip benchmarking.
As oneDNN primitive creation and dispatching can be quite complicated, this is
beneficial for developers as they cannot always predict the effect of an
implementation change. In addition, if functionality validation is supported,
this provides a much faster way to initially validate correctness by skipping
unchanged implementations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're suggesting that the tool could be used to compare performance (or correctness) differences between two versions of oneDNN? How would you handle changes to the API (breaking or otherwise) between the versions? It seems to me like this feature would add unnecessary complexity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you handle changes to the API (breaking or otherwise) between the versions?

As things currently stand, this would not be supported. The main intention here is for a fast to run regression test on a PR, not for testing between something like oneDNN releases. The problem, as I see it, is that after an optimization is made, a (most likely small) percentage of workloads are actually changed and we need to validate that optimization. If we can collect performance of 200 changed cases, we would have effective evidence for how the PR is performing in practice. This can be done in under a minute, but only if we can quickly identify what has changed. In addition, by using an automated search, we can remove accidental bias caused by developers restricting the search space to cases they believe are effected, which may not align with the reality.

* Option 2: Hash relevant primitive data to generate a UUID

Given that we are looking at generating millions of primitives, and serialized
primitive data is likely to be in the 10's of kilobytes in size, Option 1
Copy link
Contributor

@densamoilov densamoilov Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an API to query a cache blob ID from primitive descriptors to distinguish different primitives. Would it work here and have you checked how big it is?

Copy link
Contributor Author

@rjoursler rjoursler Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had not considered that API. The biggest issue I see is that primitive descriptors do not contain enough information. For example, consider the recent GEMM fix in commit 25f6896. This only effects the GPU kernel binary generated by the jit:gemm implementation, and so a kernel changed by this commit cannot be caught within the scope of the primitive descriptor.

On the other hand, the previous change in commit e8e05b4 would effect the primitive descriptor, so in theory could be caught by a cache blob ID (with appropriate modifications for this task). I had considered having a mode the filters on primitive descriptor changes, which boils down to a performance/accuracy tradeoff that developers can use based the work performed. I decided to not to include it until we know generating primitives is too slow in practice. While this definitely is a potential issue as we can only generate around ~100 OpenCL kernels per second for the GPU, this shouldn't be an issue with the move to reusable implementations unless searching lots of post-ops combinations becomes important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC A design document
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants