Releases · apache/hudi

29 May 03:05

vinothchandar

hoodie-0.4.7

227785c

Release 0.4.7

Highlights

Major releases with fundamental changes to filesystem listing & write failure handling
Introduced the first version of HoodieTimelineServer that runs embedded on the driver
With all executors fetching filesystem listing via RPC to timeline server, drastically reduced filesystem listing!
Failing concurrent write tasks are now handled differently to be robust against spark stage retries
Bug fixes/clean up around indexing, compaction

Full PR List

@bvaradar - HUDI-135 - Skip Meta folder when looking for partitions #698
@bvaradar - HUDI-136 - Only inflight commit timeline (.commit/.deltacommit) must be used when checking for sanity during compaction scheduling #699
@bvaradar - HUDI-134 - Disable inline compaction for Hoodie Demo #696
@v3nkatesh - default implementation for HBase index qps allocator #685
@bvaradar - SparkUtil#initLauncher shoudn't raise when spark-defaults.conf doesn't exist #670HUDI-131 Zero File Listing in Compactor run #693
@vinothchandar - Fixed HUDI-116 : Handle duplicate record keys across partitions #687
@leilinen - HUDI-105 : Fix up offsets not available on leader exception #650
@bvaradar - Allow users to set hoodie configs figs for Compactor, Cleaner and HDFSParquetImporter utility scripts #691
@bvaradar - Spark Stage retry handling #651
@pseudomoto - HUDI-113: Use Pair over # delimited string #672
@bvaradar - Support nested types for recordKey, partitionPath and combineKey #684
@vinothchandar - Downgrading fasterxml jackson to 2.6.7 to be spark compatible #686
@bvaradar - Timeline Service with Incremental View Syncing support #600

Assets 2

29 May 03:03

vinothchandar

hoodie-0.4.6

cc38abe

Release 0.4.6

Highlights

Index performance! Interval trees + bucketized checking speed up index lookup upto 10x!
Faster writing due to cached avro encoder/decoders, lighter memory usage, lesser data shuffled.
Support for spark jobs using > 1 cores per executor
DeltaStreamer bug fixes (inline compaction, hive sync, error record handling)
Empty Record payload to support deletes out-of-box easily
Fixes to hive/spark bundles around dependencies, versioning, shading

Full PR List

@bvaradar - Minor CLI documentation change in delta-streamer #679
@n3nash - converting map task memory from mb to bytes #678
@bvaradar - Fix various errors found by long running delta-streamer tests #675
@vinothchandar - Bucketized Bloom Filter checking #671
@pseudomuto - SparkUtil#initLauncher shoudn't raise when spark-defaults.conf doesn't exist #670
@abhioncbr - HUDI-101: added exclusion filters for signature files. #669
@ovj - migrating kryo's dependency from twitter chill to plain kryo library #649
@bvaradar - Revert "HUDI-101: added mevn-shade plugin with filters." #665
@abhioncbr - HUDI-101: added mevn-shade plugin with filters. #659
@bvaradar - Rollback inflights when using Spark [Streaming] write #660
@vinothchandar - Making DataSource/DeltaStreamer use defaults for combining #634
@vinothchandar - Fixes HUDI-85 : Interval tree based pruning for Bloom Index #653
@takezoe - Fix to enable hoodie.datasource.read.incr.filters #655
@n3nash - Removing OLD MAGIC header #648
@bvaradar - Revert "Read and apply schema for each log block from the metadata header instead of the latest schema" #647
@lyogev - Add empty payload class to support deletes via apache spark #635
@bvaradar - Move to apachehudi dockerhub repository & use openjdk docker containers #644
@bvaradar - Fix Hive RT query failure in hoodie demo #645
@ovj - Revert - Replacing Apache commons-lang3 object serializer with Kryo #642
@n3nash - Read and apply schema for each log block from the metadata header instead of the latest schema #640
@bhasudha - FIXES HUDI-98: Fix multiple issues when using build_local_docker_images for demo setup #636
@n3nash - Performing commit archiving in batches to avoid keeping a huge chunk in memory #631
@bvaradar - Essential Hive packages missing in hoodie spark bundle #633
@n3nash - 1. Minor changes to fix compaction 2. Adding 2 compaction policies 3. Adding a Hbase index property #629
@milantracy - [HUDI-66] FSUtils.getRelativePartitionPath does not handle repeated f… #627
@vinothchandar - Fixing small file handling, inline compaction defaults #599
@vinothchandar - Follow up HUDI-27 : Call super.close() in HoodieWraperFileSystem::close() #621
@vinothchandar - Fix HUDI-27 : Support num_cores > 1 for writing through spark #620
@vinothchandar - Fixes HUDI-38: Reduce memory overhead of WriteStatus #616
@vinothchandar - Fixed HUDI-87 : Remove schemastr from BaseAvroPayload #619
@vinothchandar - Fixes HUDI-9 : Check precondition minInstantsToKeep > cleanerCommitsR… #617
@n3nash - Fixing source schema and writer schema distinction in payloads #612
@ambition119 - [HUDI-63] Removed unused BucketedIndex code #608
@bvaradar - run_hive_sync tool must be able to handle case where there are multiple standalone jdbc jars in hive installation dir #609
@milantracy - add a script that shuts down demo cluster gracefully #606
@n3nash - Enable multi rollbacks for MOR table type #546
@ovj - Replacing Apache commons-lang3 object serializer with Kryo serializer #583
@kaka11chen - Add compression codec configurations for HoodieParquetWriter. #604
@smarthi - HUDI-75: Add KEYS #601
@vinothchandar - Removing docs folder from master branch #602
@bvaradar - Fix hive sync and deltastreamer issue in demo #593
@bhasudha - Fix quickstart documentation for querying via Presto #598
@ovj - Handling duplicate record update for single partition (duplicates in single or different parquet files) #584
@kaka11chen - Fix avro doesn't have short and byte type. #595
@bvaradar - FIleSystem View to handle same fileIds across partitions correctly #572
@vinothchandar - Upgrade various jar, gem versions for maintenance #575

Assets 2

29 May 03:00

vinothchandar

hoodie-0.4.5

bbf40ef

Release 0.4.5

Highlights

Dockerized demo with support for different Hive versions
Smoother handling of append log on cloud stores
Introducing a global bloom index, that enforces unique constraint across partitions
CLI commands to analyze workloads, manage compactions
Migration guide for folks wanting to move datasets to Hudi
Added Spark Structured Streaming support, with a Hudi sink
In-built support for filtering duplicates in DeltaStreamer
Support for plugging in custom transformation in DeltaStreamer
Better support for non-partitioned Hive tables
Support hard deletes for Merge on Read storage
New slack url & site urls
Added presto bundle for easier integration
Tons of bug fixes, reliability improvements

Full PR List

@bhasudha - Create hoodie-presto bundle jar. fixes #567 #571
@bhasudha - Close FSDataInputStream for meta file open in HoodiePartitionMetadata . Fixes issue #573 #574
@yaoqinn - handle no such element exception in HoodieSparkSqlWriter #576
@vinothchandar - Update site url in README
@yaooqinn - typo: bundle jar with unrecognized variables #570
@bvaradar - Table rollback for inflight compactions MUST not delete instant files at any time to avoid race conditions #565
@bvaradar - Fix Hoodie Record Reader to work with non-partitioned dataset ( ISSUE-561) #569
@bvaradar - Hoodie Delta Streamer Features : Transformation and Hoodie Incremental Source with Hive integration #485
@vinothchandar - Updating new slack signup link #566
@yaooqinn - Using immutable map instead of mutables to generate parameters #559
@n3nash - Fixing behavior of buffering in Create/Merge handles for invalid/wrong schema records #558
@n3nash - cleaner should now use commit timeline and not include deltacommits #539
@n3nash - Adding compaction to HoodieClient example #551
@n3nash - Filtering partition paths before performing a list status on all partitions #541
@n3nash - Passing a path filter to avoid including folders under .hoodie directory as partition paths #548
@n3nash - Enabling hard deletes for MergeOnRead table type #538
@msridhar - Add .m2 directory to Travis cache #534
@artem0 - General enhancements #520
@bvaradar - Ensure Hoodie works for non-partitioned Hive table #515
@xubo245 - fix some spell errorin Hudi #530
@leletan - feat(SparkDataSource): add structured streaming sink #486
@n3nash - Serializing the complete payload object instead of serializing just the GenericRecord in HoodieRecordConverter #495
@n3nash - Returning empty Statues for an empty spark partition caused due to incorrect bin packing #510
@bvaradar - Avoid WriteStatus collect() call when committing batch to prevent Driver side OOM errors #512
@vinothchandar - Explicitly handle lack of append() support during LogWriting #511
@n3nash - Fixing number of insert buckets to be generated by rounding off to the closest greater integer #500
@vinothchandar - Enabling auto tuning of insert splits by default #496
@bvaradar - Useful Hudi CLI commands to debug/analyze production workloads #477
@bvaradar - Compaction validate, unschedule and repair #481
@shangxinli - Fix addMetadataFields() to carry over 'props' #484
@n3nash - Adding documentation for migration guide and COW vs MOR tradeoffs #470
@leletan - Add additional feature to drop later arriving dups #468
@bvaradar - Fix regression bug which broke HoodieInputFormat handling of non-hoodie datasets #482
@vinothchandar - Add --filter-dupes to DeltaStreamer #478
@bvaradar - A quickstart demo to showcase Hudi functionalities using docker along with support for integration-tests #455
@bvaradar - Ensure Hoodie metadata folder and files are filtered out when constructing Parquet Data Source #473
@leletan - Adds HoodieGlobalBloomIndex #438

Assets 2

28 Sep 06:28

vinothchandar

hoodie-0.4.4

5847b61

hoodie-0.4.4

Release 0.4.4

Highlights

Dependencies are now decoupled from CDH and based on apache versions!
Support for Hive 2 is here!! Use -Dhive11 to build for older hive versions
Deltastreamer tool reworked to make configs simpler, hardended tests, added Confluent Kafka support
Provide strong consistency for S3 datasets
Removed dependency on commons lang3, to ease use with different hadoop/spark versions
Better CLI support and docs for managing async compactions
New CLI commands to manage datasets

Full PR List

@vinothchandar - Perform consistency checks during write finalize #464
@bvaradar - Travis CI tests needs to be run in quieter mode (WARN log level) to avoid max log-size errors #465
@lys0716 - Fix the name of avro schema file in Test #467
@bvaradar - Hive Sync handling must work for datasets with multi-partition keys #460
@bvaradar - Explicitly release resources in LogFileReader and TestHoodieClientBase. Fixes Memory allocation errors #463
@bvaradar - [Release Blocking] Ensure packaging modules create sources/javadoc jars #461
@vinothchandar - Fix bug with incrementally pulling older data #458
@saravsars - Updated jcommander version to fix NPE in HoodieDeltaStreamer tool #443
@n3nash - Removing dependency on apache-commons lang 3, adding necessary classes as needed #444
@n3nash - Small file size handling for inserts into log files. #413
@vinothchandar - Update Gemfile.lock with higher ffi version
@bvaradar - Simplify and fix CLI to schedule and run compactions #447
@n3nash - Fix a failing test case intermittenly in TestMergeOnRead due to incorrect prev commit time #448
@bvaradar- CLI to create and desc hoodie table #446
@vinothchandar- Reworking the deltastreamer tool #449
@bvaradar- Docs for describing async compaction and how to operate it #445
@n3nash- Adding check for rolling stats not present in existing timeline to handle backwards compatibility #451
@bvaradar @vinothchandar - Moving all dependencies off cdh and to apache #420
@bvaradar- Reduce minimum delta-commits required for compaction #452
@bvaradar- Use spark Master from environment if set #454

Assets 2

23 Aug 04:59

vinothchandar

hoodie-0.4.3

8d305c5

hoodie-0.4.3

Release 0.4.3

Highlights

Ability to run compactions asynchrously & in-parallel to ingestion/write added!!!
Day based compaction does not respect IO budgets i.e agnostic of them
Adds ability to throttle writes to HBase via the HBaseIndex
(Merge on read) Inserts are sent to log files, if they are indexable.

Full PR List

@n3nash - Adding ability for inserts to be written to log files #400
@n3nash - Fixing bug introducted in rollback for MOR table type with inserts into log files #417
@n3nash - Changing Day based compaction strategy to be IO agnostic #398
@ovj - Changing access level to protected so that subclasses can access it #421
@n3nash - Fixing missing hoodie record location in HoodieRecord when record is read from disk after being spilled #419
@bvaradar - Async compaction - Single Consolidated PR #404
@bvaradar - BUGFIX - Use Guava Optional (which is Serializable) in CompactionOperation to avoid NoSerializableException #435
@n3nash - Adding another metric to HoodieWriteStat #434
@n3nash - Fixing Null pointer exception in finally block #440
@kaushikd49 - Throttling to limit QPS from HbaseIndex #427

Assets 2

11 Jun 18:27

vinothchandar

hoodie-0.4.2

43ef385

hoodie-0.4.2

Release 0.4.2

Highlights

Parallelize Parquet writing & input record read resulting in upto 2x performance improvement
Better out-of-box configs to support upto 500GB upserts, improved ROPathFilter performance
Added a union mode for RT View, that supports near-real time event ingestion without update semantics
Added a tuning guide with suggestions for oft-encountered problems
New configs for configs for compression ratio, index storage levels

Full PR List

@jianxu - Use hadoopConf in HoodieTableMetaClient and related tests #343
@jianxu - Add more options in HoodieWriteConfig #341
@n3nash - Adding a tool to read/inspect a HoodieLogFile #328
@ovj - Parallelizing parquet write and spark's external read operation. #294
@n3nash - Fixing memory leak due to HoodieLogFileReader holding on to a logblock #346
@kaushikd49 - DeduplicateRecords based on recordKey if global index is used #345
@jianxu - Checking storage level before persisting preppedRecords #358
@n3nash - Adding config for parquet compression ratio #366
@xjodoin - Replace deprecated jackson version #367
@n3nash - Making ExternalSpillableMap generic for any datatype #350
@bvaradar - CodeStyle formatting to conform to basic Checkstyle rules. #360
@vinothchandar - Update release notes for 0.4.1 (post) #371
@bvaradar - Issue-329 : Refactoring TestHoodieClientOnCopyOnWriteStorage and adding test-cases #372
@n3nash - Parallelized read-write operations in Hoodie Merge phase #370
@n3nash - Using BufferedFsInputStream to wrap FSInputStream for FSDataInputStream #373
@suniluber - Fix for updating duplicate records in same/different files in same pa… #380
@bvaradar - Fixit : Add Support for ordering and limiting results in CLI show commands #383
@n3nash - Adding metrics for MOR and COW #365
@n3nash - Adding a fix/workaround when fs.append() unable to return a valid outputstream #388
@n3nash - Minor fixes for MergeOnRead MVP release readiness #387
@bvaradar - Issue-257: Support union mode in HoodieRealtimeRecordReader for pure insert workloads #379
@n3nash - Enabling global index for MOR #389
@suniluber - Added a new filter function to filter by record keys when reading parquet file #395
@vinothchandar - Improving out of box experience for data source #295
@xjodoin - Fix wrong use of TemporaryFolder junit rule #411

Assets 2

03 Oct 07:18

vinothchandar

hoodie-0.4.0

50139fe

hoodie-0.4.0

Release 0.4.0

Highlights

Spark datasource API now supported for Copy-On-Write datasets, across all views
BloomIndex can now prune based on key ranges & cut down index tagging time dramatically, for time-prefixed/ordered record keys
Hive sync tool registers RO and RT tables now.
Client application can now specify the partitioner to be used by bulkInsert(), useful for low-level control over initial record placement
Framework for metadata tracking inside IO handles, to implement Spark accumulator-style counters, that are consistent with the timeline
Bug fixes around cleaning, savepoints & upsert's partitioner.

Full PR List

@gekath - Writes relative paths to .commit files #184
@kaushikd49 - Correct clean bug that causes exception when partitionPaths are empty #202
@vinothchandar - Refactor HoodieTableFileSystemView using FileGroups & FileSlices #201
@prazanna - Savepoint should not create a hole in the commit timeline #207
@jianxu - Fix TimestampBasedKeyGenerator in HoodieDeltaStreamer when DATE_STRING is used #211
@n3nash - Sync Tool registers 2 tables, RO and RT Tables #210
@n3nash - Using FsUtils instead of Files API to extract file extension #213
@vinothchandar - Edits to documentation #219
@n3nash - Enabled deletes in merge_on_read #218
@n3nash - Use HoodieLogFormat for the commit archived log #205
@n3nash - fix for cleaning log files in master branch (mor) #228
@vinothchandar - Adding range based pruning to bloom index #232
@n3nash - Use CompletedFileSystemView instead of CompactedView considering deltacommits too #229
@n3nash - suppressing logs (under 4MB) for jenkins #240
@jianxu - Add nested fields support for MOR tables #234
@n3nash - adding new config to separate shuffle and write parallelism #230
@n3nash - adding ability to read archived files written in log format #252
@ovj - Removing randomization from UpsertPartitioner #253
@ovj - Replacing SortBy with custom partitioner #245
@esmioley - Update deprecated hash function #259
@vinothchandar - Adding canIndexLogFiles(), isImplicitWithStorage(), isGlobal() to HoodieIndex #268
@kaushikd49 - Hoodie Event callbacks #251
@vinothchandar - Spark Data Source (finally) #266

Assets 2

16 Jun 18:04

prazanna

hoodie-0.3.8

45732e4

hoodie-0.3.8 (MOR) MVP

Highlights

Merge on Read tested end to end. Ingestion - Hive Registration - Querying non-nested fields
Contributions from @kaushikd49 @n3nash @dannyhchen @zqureshi @vinothchandar and @prazanna

New Features

#149 Introduce custom log format (HoodieLogFormat) for the log files
#141 Introduce Compaction Strategies for Merge on Read table and implement UnboundedCompactionStrategy and IOBoundedCompactionStrategy
#42 Implement HoodieRealtimeInputFormat and HoodieRealtimeRecordReader
#150 Rewrite hoodie-hive to incrementally sync partitions based on the last commit that was sucessfully synced

Changes

#168 - Handle skew in time taken to clean
Updated community committership guidelines
Add GCS support
Add S3 support
Support for viewFS

Commits: 21e334...4b26be

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlights

Full PR List

Highlights

Full PR List

Highlights

Full PR List

Release 0.4.4

Highlights

Full PR List

Release 0.4.3

Highlights

Full PR List

Release 0.4.2

Highlights

Full PR List

Release 0.4.0

Highlights

Full PR List

Highlights

New Features

Changes

Releases: apache/hudi

Release 0.4.7

Highlights

Full PR List

Release 0.4.6

Highlights

Full PR List

Release 0.4.5

Highlights

Full PR List

hoodie-0.4.4

Release 0.4.4

Highlights

Full PR List

hoodie-0.4.3

Release 0.4.3

Highlights

Full PR List

hoodie-0.4.2

Release 0.4.2

Highlights

Full PR List

hoodie-0.4.0

Release 0.4.0

Highlights

Full PR List

hoodie-0.3.8 (MOR) MVP

Highlights

New Features

Changes