New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Override call_cuml_fit_func to use Dataframe, model saving+loading as numpy #352

Merged

rishic3 merged 4 commits into NVIDIA:branch-23.08 from rishic3:umap-overrides

Aug 8, 2023

Collaborator

rishic3 commented Aug 7, 2023 •

edited

Loading

override call_cuml_fit_func

removed barrier stages, returns dataframe rather than rdd
allows for:
- coalesce() rather than repartition
- toPandas() rather than collect
- storing raw_data / embeddings as numpy rather than list
- saving attributes as float32 rather than python float (float64)
speedups below

dataset: 50,000 x 3000, float32, parquet

	fit runtime	transform runtime
rdd + repartition + collect (no overrides)	58.5s	29.2s
df + coalesce + collect (python lists)	46.7s	28.9s
df + coalesce + toPandas (numpy arrays)	24.9s	8.7s

dataset: 100,000 x 3000, float32, parquet

	fit runtime	transform runtime
rdd + repartition + collect (no overrides)	113.3s	65.4s
df + coalesce + toPandas (numpy arrays)	48.5s	23.7s

override modelwriter / modelreader

subclassed cumlmodelreader and cumlmodelwriter to handle numpy saving + loading
- saves arrays with np.save, creates subdirectory for other model attributes ("metadata")
allows for continuous use of numpy arrays between fit and transform phases
saves memory and preserves float32 dtype

rishic3 added 3 commits

August 7, 2023 12:53


          overrides, numpy saving+loading

25591ba


          dtype updates

e4480a7


          fixed dtyping

cac219d

Signed-off-by: Rishi <[email protected]>

rishic3 marked this pull request as ready for review

August 7, 2023 21:07

leewyang reviewed

View reviewed changes

Collaborator

leewyang left a comment

Looks nice! Just some minor comments and some questions for rest of team.

python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved

python/src/spark_rapids_ml/umap.py

+                          for row in result:
+                              yield row
+                      output_df = dataset.mapInPandas(_train_udf, schema=self._out_schema())

Collaborator

leewyang Aug 7, 2023

If this is mostly duplicated code, wondering if it can be refactored into the existing API.

Also, is the fit_multiple_params API (from @wbo4958) explicitly unsupported then? If so, maybe we should document this, especially if it's removed for specific reasons.

Collaborator Author

rishic3 Aug 7, 2023 •

edited

Loading

Took out fit_multiple_params since fit_multiple isn't supported in UMAP - basically just trimmed out everything that wasn't relevant to UMAP specifically since this func only lives in UMAP atm.

As for refactoring this into the existing API, don't think there's a clean way without overriding or creating a new call_fit_func in core due to the RDD return signature and the barrier stuff. If we're interested in using dataframes for future algos that don't require NCCL during fit, we could have a second call_cuml_fit_func within core like the one in this PR for those use cases which future algos (and this algo) could inherit from. Not sure if this is preferred,

python/src/spark_rapids_ml/umap.py Show resolved Hide resolved

python/src/spark_rapids_ml/umap.py Outdated Show resolved Hide resolved

python/src/spark_rapids_ml/umap.py

+                          raw_data = self.raw_data
+                          if embedding.dtype != np.float32:
+                              embedding = embedding.astype(np.float32)
+                              raw_data = raw_data.astype(np.float32)

Collaborator

leewyang Aug 7, 2023

Probably should log a warning that we're auto-converting the type (but only if it's not too chatty).

Collaborator Author

rishic3 Aug 7, 2023 •

edited

Loading

At the moment I'm not supporting user-control over the "convert_dtype" param from cuml (determines whether the internal computations are float64); currently just defaulting to float32. I figured we could keep it like that for now for perf reasons and add float64 support in a future pr (and like Erik mentioned, maybe include a default conversion to float32 much earlier, long before we get to the cuml side, if desired).

python/src/spark_rapids_ml/umap.py Show resolved Hide resolved

Collaborator

wbo4958 commented Aug 8, 2023

@rishic3 Could you run

rdd + no-repartition + collect (no overrides) by adding

--conf spark.sql.files.minPartitionNum=$gpu_workers --conf spark.sql.files.maxPartitionBytes=50000000000

Collaborator

leewyang commented Aug 8, 2023

build

Collaborator Author

rishic3 commented Aug 8, 2023

@rishic3 Could you run

rdd + no-repartition + collect (no overrides) by adding

--conf spark.sql.files.minPartitionNum=$gpu_workers --conf spark.sql.files.maxPartitionBytes=50000000000

Both tests were run with these settings. I used the benchmark spark config


          accessor fixes

9308cb4

Collaborator Author

rishic3 commented Aug 8, 2023

build

leewyang approved these changes

View reviewed changes

rishic3 merged commit 9ddc749 into NVIDIA:branch-23.08

1 check passed

rishic3 deleted the umap-overrides branch

September 26, 2024 21:35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet