VTX-666: Sync from upstream #51

fsdvh · 2024-09-26T10:17:57Z

Which issue does this PR close?

Closes #.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

* Add specific fixed size list concat test * Add fixed size list concat benchmark * Improve `FixedSizeList` concat performance for large list * `cargo fmt` * Increase size of `FixedSizeList` benchmark data * Get capacity recursively for `FixedSizeList` * Reuse `Capacities::List` to avoid breaking change * Use correct default capacities * Avoid a `Box::new()` when not needed * format --------- Co-authored-by: Will Jones <[email protected]>

* add neq/eq benchmark for String/ViewArray * move bench to comparsion kernel * clean unnecessary dep * make clippy happy

…s are different (apache#5703) * Add the ability for Maps to cast to another case where the field names are different. Arrow Maps have field names for the elements of the fields, the field names are allowed to be any value and do not affect the type of the data. This allows a Map where the field names are key_value, key, value to be mapped to a entries, keys, values. This can be helpful in merging record batches that may have come from different sources. This also makes maps behave similar to lists which also have a field to distinguish their elements. * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * Feedback from code review - simplify map casting logic to reuse the entries - Added unit tests for negative cases - Use MapBuilder to make the intended type clearer. * fix formatting * Lint and format * correctly set the null fields --------- Co-authored-by: Andrew Lamb <[email protected]>

…fields (apache#5918)

…apache#5913) Updates the requirements on [zstd-sys](https://github.com/gyscos/zstd-rs) to permit the latest version. - [Release notes](https://github.com/gyscos/zstd-rs/releases) - [Commits](gyscos/zstd-rs@zstd-sys-2.0.7...zstd-sys-2.0.11) --- updated-dependencies: - dependency-name: zstd-sys dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* add impl for box * update * another update * small fix

* implement arrow-row encoding/decoding for view types * add doc comments, better error msg, more test coverage * ensure no performance regression * update perf * fix bug * make fmt happy * Update arrow-array/src/array/byte_view_array.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * update * update comments * move cmp around * move things around and remove inline hint * Update arrow-array/src/array/byte_view_array.rs Co-authored-by: Andrew Lamb <[email protected]> * Update arrow-ord/src/cmp.rs Co-authored-by: Andrew Lamb <[email protected]> * return error instead of panic * remove unnecessary func --------- Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Raphael Taylor-Davies <[email protected]>

…pache#5946) Updates the requirements on [quick-xml](https://github.com/tafia/quick-xml) to permit the latest version. - [Release notes](https://github.com/tafia/quick-xml/releases) - [Changelog](https://github.com/tafia/quick-xml/blob/master/Changelog.md) - [Commits](tafia/quick-xml@v0.32.0...v0.33.0) --- updated-dependencies: - dependency-name: quick-xml dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* like for string view array * fix bug * update doc * update tests

* test: Add unit test for extending slice of list array * For review

…pache#5954) Updates the requirements on [quick-xml](https://github.com/tafia/quick-xml) to permit the latest version. - [Release notes](https://github.com/tafia/quick-xml/releases) - [Changelog](https://github.com/tafia/quick-xml/blob/master/Changelog.md) - [Commits](tafia/quick-xml@v0.33.0...v0.34.0) --- updated-dependencies: - dependency-name: quick-xml dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Improve error message for unsupported nested comparison * Update arrow-ord/src/cmp.rs Co-authored-by: Jay Zhan <[email protected]> --------- Co-authored-by: Jay Zhan <[email protected]>

* skip iterator removed from primitive encoding * special cases for not-null primitives encoding * faster iterators for nullable columns

* Document process for PRs with breaking changes * ticket reference * Update CONTRIBUTING.md Co-authored-by: Xuanwo <[email protected]> --------- Co-authored-by: Xuanwo <[email protected]>

…pache#5928) * Expose IntervalMonthDayNano and IntervalDayMonth and update docs * fix doc test

* implement sort for view types * add bench for binary/binary view

* implement sort for view types * add bench for binary/binary view * add view buffer, prepare for byte_view_array reader * make clippy happy * reuse make_view_unchecked * Update parquet/src/arrow/buffer/view_buffer.rs Co-authored-by: Andrew Lamb <[email protected]> * update * rename and inline --------- Co-authored-by: Andrew Lamb <[email protected]>

* failing test * Handle dict ID assignment during flight encoding/decoding * remove println * One more println * Make auto-assign optional * Update docs * Remove breaking change * Update arrow-ipc/src/writer.rs Co-authored-by: Andrew Lamb <[email protected]> * Remove breaking change to DictionaryTracker ctor --------- Co-authored-by: Andrew Lamb <[email protected]>

* Make ObjectStoreScheme public * Fix clippy, add docs and examples --------- Co-authored-by: Andrew Lamb <[email protected]>

) (apache#5980)

* support def_level=1 but non-null column in reader * update comment, adapt ut to the uuid change --------- Co-authored-by: Ye Yuan <[email protected]>

* Update Azure dependencies and add support for Fabric token authentication * Refactor Azure credential provider to support Fabric token authentication * Refactor Azure credential provider to remove unnecessary print statements and improve token handling * Bump object_store version to 0.11.0 * Refactor Azure credential provider to remove unnecessary print statements and improve token handling

* add benchmark * add optimization * fix * fix * cargo fmt * clippy * Update arrow-data/src/decimal.rs Co-authored-by: Liang-Chi Hsieh <[email protected]> * optimize to avoid allocating an idx variable * revert change to public api * fix error in rustdoc --------- Co-authored-by: Liang-Chi Hsieh <[email protected]>

…egexp_is_match_scalar` function, deprecate `regexp_is_match_utf8` and `regexp_is_match_utf8_scalar` (apache#6376) * Implement native support StringViewArray for regex_is_match function * Update test cases cover StringViewArray length more then 12 bytes * Add StringView benchmark for regexp_is_match Signed-off-by: Tai Le Manh <[email protected]> * Implement native support StringViewArray for regex_is_match function Signed-off-by: Tai Le Manh <[email protected]> * Remove duplicate implementation, fix clippy, add docs more --------- Signed-off-by: Tai Le Manh <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

Especially when transferring large amounts of data over HTTP/2, this can massively reduce the overhead.

* chore: add docs, part of #37 - add pragma `#![warn(missing_docs)]` to the following - `arrow-array` - `arrow-cast` - `arrow-csv` - `arrow-data` - `arrow-json` - `arrow-ord` - `arrow-pyarrow-integration-testing` - `arrow-row` - `arrow-schema` - `arrow-select` - `arrow-string` - `arrow` - `parquet_derive` - add docs to those that generated lint warnings - Remove `bitflags` workaround in `arrow-schema` At some point, a change in `bitflags v2.3.0` had started generating lint warnings in `arrow-schema`, This was handled using a [workaround](apache#4233) [Issue](bitflags/bitflags#356) `bitflags v2.3.1` fixed the issue hence the workaround is no longer needed. * fix: resolve comments on PR apache#6433

* fix CI errors * apply suggestion from review Co-authored-by: ngli-me <[email protected]> --------- Co-authored-by: ngli-me <[email protected]>

* Update prost-build requirement from =0.13.2 to =0.13.3 Updates the requirements on [prost-build](https://github.com/tokio-rs/prost) to permit the latest version. - [Release notes](https://github.com/tokio-rs/prost/releases) - [Changelog](https://github.com/tokio-rs/prost/blob/master/CHANGELOG.md) - [Commits](tokio-rs/prost@v0.13.2...v0.13.3) --- updated-dependencies: - dependency-name: prost-build dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]> * update vendored code --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andrew Lamb <[email protected]>

* add ParquetMetaDataReader * clippy * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * formatting * add ParquetMetaDataReader to module documentation * document erros returned from `try_parse_sized` * oops * rename methods per review suggestion --------- Co-authored-by: Andrew Lamb <[email protected]>

* feat: add union_extract kernel * fix: reexport union_extract in arrow crate * add tests, improve docs, simplify code --------- Co-authored-by: Andrew Lamb <[email protected]>

…especting not preserving dict ID (apache#6444) * arrow-ipc: Add test for non preserving dict ID behavior with same ID * arrow-ipc: Always set dict ID in IPC from dictionary tracker This decouples dictionary IDs that end up in IPC from the schema further because the dictionary tracker always first gathers the dict ID for each field whether it is pre-defined and preserved or not. Then when actually writing the IPC bytes the dictionary ID is always taken from the dictionary tracker as opposed to falling back to the `Field` of the `Schema`. * arrow-ipc: Read dictionary IDs from dictionary tracker in correct order When dictionary IDs are not preserved, then they are assigned depth first, however, when reading them from the dictionary tracker to write the IPC bytes, they were previously read from the dictionary tracker in the order that the schema is traversed (first come first serve), which caused an incorrect order of dictionaries serialized in IPC. * Refine IpcSchemaEncoder API and docs * reduce repeated code * Fix lints --------- Co-authored-by: Andrew Lamb <[email protected]>

…e#6441) * Minor: Add additional documentation and builder APIs to `SortOptions` * Port some uses * Update defaults * Add nulls_first() and nulls_last() and more examples

…pache#6450) * workaround for missing page indexes * remove empty line * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * fmt --------- Co-authored-by: Andrew Lamb <[email protected]>

…pache#6452) * Support cast between Durations Signed-off-by: tison <[email protected]> * Support cast between Durations and all numeric type Signed-off-by: tison <[email protected]> * Impl cast between Durations Signed-off-by: tison <[email protected]> * Add test_cast_between_durations Signed-off-by: tison <[email protected]> * add test cases Signed-off-by: tison <[email protected]> * cargo clippy Signed-off-by: tison <[email protected]> --------- Signed-off-by: tison <[email protected]>

judahrand and others added 30 commits June 21, 2024 12:51

Add eq benchmark for StringArray/StringViewArray (apache#5924)

13c9e90

* add neq/eq benchmark for String/ViewArray * move bench to comparsion kernel * clean unnecessary dep * make clippy happy

fix(ipc): set correct row count when reading struct arrays with zero …

86eb191

…fields (apache#5918)

Add MultipartUpload blanket implementation for Box<W> (apache#5919)

0ea074a

* add impl for box * update * another update * small fix

Fix typo in benchmarks (apache#5935)

a35214f

row format benches for bool & nullable int (apache#5943)

063ac13

Better document support for nested comparison (apache#5942)

c084342

Implement like/ilike etc for StringViewArray (apache#5931)

66bada5

* like for string view array * fix bug * update doc * update tests

test: Add unit test for extending slice of list array (apache#5948)

460fd55

* test: Add unit test for extending slice of list array * For review

Minor: fixup contribution guide (apache#5952)

901fbe8

chore(5797): change default data_page_row_limit to 20k (apache#5957)

0e56fd5

Improve error message for unsupported nested comparison (apache#5961)

4b326f6

* Improve error message for unsupported nested comparison * Update arrow-ord/src/cmp.rs Co-authored-by: Jay Zhan <[email protected]> --------- Co-authored-by: Jay Zhan <[email protected]>

feat: add max_bytes and min_bytes on PageIndex (apache#5950)

45190ab

Faster primitive arrays encoding into row format (apache#5858)

6b03162

* skip iterator removed from primitive encoding * special cases for not-null primitives encoding * faster iterators for nullable columns

Document process for PRs with breaking changes (apache#5953)

e5604aa

* Document process for PRs with breaking changes * ticket reference * Update CONTRIBUTING.md Co-authored-by: Xuanwo <[email protected]> --------- Co-authored-by: Xuanwo <[email protected]>

like benchmark for StringView (apache#5936)

1ef22e5

Expose IntervalMonthDayNano and IntervalDayTime and update docs (a…

ee55721

…pache#5928) * Expose IntervalMonthDayNano and IntervalDayMonth and update docs * fix doc test

implement sort for view types (apache#5963)

6bc9514

Fix FFI array offset handling (apache#5964)

0a4d8a1

Add benchmark for reading binary/binary view from parquet (apache#5968)

c5b5eda

* implement sort for view types * add bench for binary/binary view

Make ObjectStoreScheme public (apache#5912)

a4d2167

* Make ObjectStoreScheme public * Fix clippy, add docs and examples --------- Co-authored-by: Andrew Lamb <[email protected]>

Add operation in ArrowNativeTypeOp::neg_check error message (apache#5944

6230435

) (apache#5980)

feat: support reading OPTIONAL column in parquet_derive (apache#5717)

62c1615

* support def_level=1 but non-null column in reader * update comment, adapt ut to the uuid change --------- Co-authored-by: Ye Yuan <[email protected]>

RobinLin666 and others added 19 commits September 21, 2024 06:15

bump arrow-flight msrv to 1.71.1 (apache#6437)

b809021

feat: expose HTTP/2 max frame size in object_store (apache#6442)

7191f4d

Especially when transferring large amounts of data over HTTP/2, this can massively reduce the overhead.

Fix doc "bit width" to "byte width" (apache#6434)

477b9f0

Minor: Add some missing documentation to fix CI errors (apache#6445)

4ab97f9

* fix CI errors * apply suggestion from review Co-authored-by: ngli-me <[email protected]> --------- Co-authored-by: ngli-me <[email protected]>

throw arrow error instead of panic (apache#6456)

43dd5e4

Disable rust<>nanoarrow integration test in CI (apache#6449)

4e2b939

Add union_extract kernel (apache#6387)

922a1ff

* feat: add union_extract kernel * fix: reexport union_extract in arrow crate * add tests, improve docs, simplify code --------- Co-authored-by: Andrew Lamb <[email protected]>

Add additional documentation and builder APIs to SortOptions (apach…

6137e91

…e#6441) * Minor: Add additional documentation and builder APIs to `SortOptions` * Port some uses * Update defaults * Add nulls_first() and nulls_last() and more examples

Update Cargo.toml (apache#6459)

50e9e49

Merge remote-tracking branch 'upstream/master' into sync-from-upstream

458fb77

github-actions bot added arrow arrow-flight object-store parquet parquet-derive labels Sep 26, 2024

remove dup

55751ee

thinkharderdev approved these changes Sep 26, 2024

View reviewed changes

fsdvh changed the title ~~Sync from upstream~~ VTX-666: Sync from upstream Sep 26, 2024

empty

0b0b3ad

fsdvh merged commit f6df09b into master Sep 26, 2024
35 checks passed

fsdvh deleted the sync-from-upstream branch September 26, 2024 11:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VTX-666: Sync from upstream #51

VTX-666: Sync from upstream #51

fsdvh commented Sep 26, 2024

VTX-666: Sync from upstream #51

VTX-666: Sync from upstream #51

Conversation

fsdvh commented Sep 26, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?