Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark Supported Data Types #2206

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

pvk-developer
Copy link
Member

Resolves #2200
CU-86b1xxa7d

This pull request introduces a benchmarking suite designed to test all supported data types for our synthesizers and validation processes. Key changes and additions include:

  • Benchmarking Integration: Added a new benchmarking framework to evaluate the functionality of all supported data types.

  • Private Spreadsheet Integration: The benchmarking results are compared against data read from a private spreadsheet. This spreadsheet contains the expected outcomes for each data type, ensuring that our tests remain accurate and relevant.

  • Automated Test Failures: If a data type is no longer supported due to recent changes, the test will automatically fail. This helps in catching unsupported data types and ensures that our system continues to function correctly with all valid data types.

@sdv-team
Copy link
Contributor

sdv-team commented Sep 6, 2024

@pvk-developer pvk-developer marked this pull request as ready for review September 6, 2024 12:19
@pvk-developer pvk-developer requested a review from a team as a code owner September 6, 2024 12:19
@pvk-developer pvk-developer requested review from rwedge and removed request for a team September 6, 2024 12:19
tests/benchmark/supported_dtypes_benchmark.py Outdated Show resolved Hide resolved
np.datetime64('2025-01-01T00:00:00'),
])
}),
'np.timedelta64': pd.DataFrame({
Copy link
Contributor

@gsheni gsheni Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a few more numpy dtypes:

import numpy as np

np.dtypes.Float16DType()
np.dtypes.Float32DType()
np.dtypes.Float64DType()

tests/benchmark/supported_dtypes_benchmark.py Outdated Show resolved Hide resolved
tests/benchmark/supported_dtypes_benchmark.py Show resolved Hide resolved
tests/benchmark/supported_dtypes_benchmark.py Show resolved Hide resolved
tests/benchmark/utils.py Outdated Show resolved Hide resolved
}

PYARROW_DTYPES = {
'pa.int8': pd.DataFrame({'pa.int8': pd.Series([1, -1, 127], dtype=pd.ArrowDtype(pa.int8()))}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be NaNs for the all the columns? I believe pyarrow supports that

- main

jobs:
build:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: can we use more specific names for the jobs? We can actually require certain jobs to pass before allowing merging, so it's helpful if the names are unique

on:
push:
branches:
- main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we still running the tests every time without updating the sheet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the benchmark is only for message in slack and updating the gdrive.

tests/benchmark/supported_dtypes_benchmark.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Supported data types benchmark
4 participants