Skip to content

Commit

Permalink
Parallelize Parquet Serialization (#7562)
Browse files Browse the repository at this point in the history
* initial implementation

* cargo fmt

* unbounded channel and flush worker

* disable parallelism by default

* update configs.md

* fix information_schema test
  • Loading branch information
devinjdangelo committed Sep 18, 2023
1 parent f4c4ee1 commit 5718a3f
Show file tree
Hide file tree
Showing 4 changed files with 341 additions and 57 deletions.
9 changes: 9 additions & 0 deletions datafusion/common/src/config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -353,6 +353,15 @@ config_namespace! {
/// Sets bloom filter number of distinct values. If NULL, uses
/// default parquet writer setting
pub bloom_filter_ndv: Option<u64>, default = None

/// Controls whether DataFusion will attempt to speed up writing
/// large parquet files by first writing multiple smaller files
/// and then stitching them together into a single large file.
/// This will result in faster write speeds, but higher memory usage.
/// Also currently unsupported are bloom filters and column indexes
/// when single_file_parallelism is enabled.
pub allow_single_file_parallelism: bool, default = false

}
}

Expand Down
Loading

0 comments on commit 5718a3f

Please sign in to comment.