-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for insert of Apache Arrow tables (polars DataFrame) #63
Comments
Looks like creating the Arrow native payload really is straightforward: use std::fs::File;
use arrow2::io::ipc::write;
use polars::prelude::*;
fn write_batches(path: &str, df: DataFrame) -> PolarsResult<()> {
let file = File::create(path)?;
// get Arrow schema from Polars' DataFrame
let schema = df.schema().to_arrow();
// write out in Arrow RecordBatches
let options = write::WriteOptions { compression: None };
let mut writer = write::FileWriter::new(file, schema, None, options);
writer.start()?;
for chunk in df.iter_chunks() {
writer.write(&chunk, None)?
}
writer.finish()?;
Ok(())
} The binary data in the file can be sent to ClickHouse, e.g. cat df.arrow | clickhouse-client --ask-password --query="INSERT INTO schema.table FORMAT Arrow" |
Hello,
Can you share your benchmark code?
I expect it to be less efficient than Some However, providing separate api for arrow is initially good idea, I need to thing about it. |
Hi, The following Python con.insert_arrow('my_table', pa.Table.from_pandas(df, preserve_index=False)) Note that this includes the conversion from pandas numpy backend to pyarrow The aquivalent Rust logic takes 0.24 s in release mode (if there is a faster way, please let me know): let mut insert = con.insert("my_table")?;
for row in rows {
insert.write(&row).await?;
}
insert.end().await?;
pub symbol: String,
#[serde(with = "clickhouse::serde::time::datetime64::millis")]
pub dt_close: OffsetDateTime,
pub open: f32,
pub high: f32,
pub low: f32,
pub close: f32,
pub volume: f32, Comes very close to the pandas/arrow version, but surprisingly, it's slower despite having to do no additional conversion to Apart from the potential performance gains, having |
@rbeeli, have you disabled compression in this library? I mean, I like the arrow format, but I'm unsure if I should move this library to TCP+Arrow instead of TCP+Native. |
Would really love to see some TCP+Arrow capabilities 👍 |
Hi,
The Python client of ClickHouse allows to insert a raw
pyarrow.Table
via theinsert_arrow
method, which sends the Apache Arrow encoded data 1:1 to ClickHouse through ClickHouse'sArrowStream
format. This is incredibly efficient.Code is quite short, see https://github.com/ClickHouse/clickhouse-connect/blob/fa20547d7f7e2fd3a2cf4cd711c3262c5a79be7a/clickhouse_connect/driver/client.py#L576
Surprisingly, the INSERTs using Arrow in Python are even faster than this ClickHouse Rust client using
RowBinary
format, though I have not investigated where this client loses time.Has anyone looked into Apache Arrow support and benchmarked it? Rust's
polars
is based on Apache Arrow as backend--using the native insert format seems like the logical choice, providing an easy way to directly insert a polars DataFrame into ClickHouse. Supporting Arrow would potentially improve performance and we could directly query/insert a whole polars DataFrame.These are all Arrow-based standards and supported by ClickHouse/polars, so the extension might be straightforward.
The text was updated successfully, but these errors were encountered: