Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Override call_cuml_fit_func to use Dataframe, model saving+loading as numpy #352

Merged
merged 4 commits into from
Aug 8, 2023

Conversation

rishic3
Copy link
Collaborator

@rishic3 rishic3 commented Aug 7, 2023

override call_cuml_fit_func

  • removed barrier stages, returns dataframe rather than rdd
  • allows for:
    • coalesce() rather than repartition
    • toPandas() rather than collect
    • storing raw_data / embeddings as numpy rather than list
    • saving attributes as float32 rather than python float (float64)
  • speedups below

dataset: 50,000 x 3000, float32, parquet

fit runtime transform runtime
rdd + repartition + collect (no overrides) 58.5s 29.2s
df + coalesce + collect (python lists) 46.7s 28.9s
df + coalesce + toPandas (numpy arrays) 24.9s 8.7s

dataset: 100,000 x 3000, float32, parquet

fit runtime transform runtime
rdd + repartition + collect (no overrides) 113.3s 65.4s
df + coalesce + toPandas (numpy arrays) 48.5s 23.7s

override modelwriter / modelreader

  • subclassed cumlmodelreader and cumlmodelwriter to handle numpy saving + loading
    • saves arrays with np.save, creates subdirectory for other model attributes ("metadata")
  • allows for continuous use of numpy arrays between fit and transform phases
  • saves memory and preserves float32 dtype

@rishic3 rishic3 marked this pull request as ready for review August 7, 2023 21:07
Copy link
Collaborator

@leewyang leewyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice! Just some minor comments and some questions for rest of team.

python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
for row in result:
yield row

output_df = dataset.mapInPandas(_train_udf, schema=self._out_schema())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is mostly duplicated code, wondering if it can be refactored into the existing API.

Also, is the fit_multiple_params API (from @wbo4958) explicitly unsupported then? If so, maybe we should document this, especially if it's removed for specific reasons.

Copy link
Collaborator Author

@rishic3 rishic3 Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took out fit_multiple_params since fit_multiple isn't supported in UMAP - basically just trimmed out everything that wasn't relevant to UMAP specifically since this func only lives in UMAP atm.

As for refactoring this into the existing API, don't think there's a clean way without overriding or creating a new call_fit_func in core due to the RDD return signature and the barrier stuff. If we're interested in using dataframes for future algos that don't require NCCL during fit, we could have a second call_cuml_fit_func within core like the one in this PR for those use cases which future algos (and this algo) could inherit from. Not sure if this is preferred,

python/src/spark_rapids_ml/umap.py Show resolved Hide resolved
python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved
raw_data = self.raw_data
if embedding.dtype != np.float32:
embedding = embedding.astype(np.float32)
raw_data = raw_data.astype(np.float32)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should log a warning that we're auto-converting the type (but only if it's not too chatty).

Copy link
Collaborator Author

@rishic3 rishic3 Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I'm not supporting user-control over the "convert_dtype" param from cuml (determines whether the internal computations are float64); currently just defaulting to float32. I figured we could keep it like that for now for perf reasons and add float64 support in a future pr (and like Erik mentioned, maybe include a default conversion to float32 much earlier, long before we get to the cuml side, if desired).

python/src/spark_rapids_ml/umap.py Show resolved Hide resolved
@wbo4958
Copy link
Collaborator

wbo4958 commented Aug 8, 2023

@rishic3 Could you run

rdd + no-repartition + collect (no overrides) by adding

--conf spark.sql.files.minPartitionNum=$gpu_workers --conf spark.sql.files.maxPartitionBytes=50000000000

@leewyang
Copy link
Collaborator

leewyang commented Aug 8, 2023

build

@rishic3
Copy link
Collaborator Author

rishic3 commented Aug 8, 2023

@rishic3 Could you run

rdd + no-repartition + collect (no overrides) by adding

--conf spark.sql.files.minPartitionNum=$gpu_workers --conf spark.sql.files.maxPartitionBytes=50000000000

Both tests were run with these settings. I used the benchmark spark config

@rishic3
Copy link
Collaborator Author

rishic3 commented Aug 8, 2023

build

@rishic3 rishic3 merged commit 9ddc749 into NVIDIA:branch-23.08 Aug 8, 2023
1 check passed
@rishic3 rishic3 deleted the umap-overrides branch September 26, 2024 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants