[HUDI-6539] New LSM tree style archived timeline #9209

danny0405 · 2023-07-16T03:47:37Z

Change Logs

A new LSM style archived timeline.

Impact

none

Risk level (write none, low medium or high below)

none

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

vinothchandar

can you please split the code earlier into a "org.apache.hudi.storage.lsm" package? where we keep the parquet LSM code away from its use in the ArchivedTimeline? I think it ll help with even reviewing code a lot.

vinothchandar · 2023-07-31T00:07:27Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ActiveInstant.java

+/**
+ * A combination of instants covering action states: requested, inflight, completed.
+ */
+public class ActiveInstant implements Serializable, Comparable<ActiveInstant> {


rename to "ActiveAction" since its really the instants that make up the action to completed state?

It's just a triple of an instant with 3 different states, maybe we can come out with a better name for it.

Lets call this class. ActiveAction which is a triplet of instants . That's how we call this today

+1 for ActiveAction.

vinothchandar · 2023-07-31T00:09:08Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ActiveInstant.java

+  }
+
+  /**
+   * A COMPACTION action eventually becomes COMMIT when completed.


something to think about is - whether we keep it "COMPACTION" with the new changes.

Personally I'm +1 for keeping the action of compaction and log_compaction just as it is, this avoids many ambiguilties, but I kind of think it should be a separate topic, we need a discuss whether to reuse the action for all kinds of table services: compaction, log_compaction, clustering, etc.

+1 we can decouple this.

vinothchandar · 2023-07-31T00:18:33Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java

+ *
+ * <p><h2>The LSM Tree Compaction</h2>
+ * Use the universal compaction strategy, that is: when N(by default 10) number of parquet files exist in the current layer, they are merged and flush as a large file in the next layer.
+ * We have no limit for the layer number, assumes there are 10 instants for each file in L0, there could be 100 instants per-file in L1,


might be good to think about a bound here and control how LSM merge is going to be based on that.
I suggest . Not having more than 1GB files, to ensure merge process can run on lower end VMs/machines.

Yeah, size based compaction makes sense to me, we can always pick the oldest files from a layer but also control the gross file size of source files to be under 1GB.

size makes sense, do we need the size info in the manifest as well then?

Should we do some schemaful file to store manifests? json/parquet/avro? Can we get all information that we need to plan LSM compaction/cleaning into the manifest, so that we are not listing anything.

Finally, we encode the manifet content as a JSON string, which is more friendly to extend in the future if there are new requests. Also, the file size are encoded along with the file name.

vinothchandar · 2023-07-31T00:19:34Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java

+ *
+ * <p>In order to make snapshot isolation of the archived timeline write/read, we add two kinds of metadata files for the LSM tree version management:
+ * <ol>
+ *   <li>Manifest file: Each new file in layer 0 or each compaction would generate a new manifest file, the manifest file records the valid file handles of the latest snapshot;</li>


i.e list of all files in the entire LSM. correct?

Correct, to be more accurate, it keeps the list of all the files in the latest snapshot. The LSM tree itself has multiple versions.

vinothchandar · 2023-07-31T00:20:00Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java

+ * <p>In order to make snapshot isolation of the archived timeline write/read, we add two kinds of metadata files for the LSM tree version management:
+ * <ol>
+ *   <li>Manifest file: Each new file in layer 0 or each compaction would generate a new manifest file, the manifest file records the valid file handles of the latest snapshot;</li>
+ *   <li>Version file: A version file is generated right after a complete manifest file is formed.</li>


what does this contain? Pointer to the latest manifest file?

It behaves like a MARKER file for the manifest file, a version file handle indicates the version of snapshot is now complete for reader view, the reader list all the valid version files to fetch the valid versions of the current timeline.

To understand this better,

for distributed file system like HDFS: this helps reader from reading any partially written MANIFEST file?
for cloud storage: PUTs are atomic, so reader will not see any partially written MANIFEST file?

Yes, but is there any possibility the reader see an empty MANIFEST file? Anyway, we now always write a temp file first then rename to the final file.

Update:

we now only have 1 version hint file, whose content is the latest version number.

for HDFS, we will do a rename from a temp file to the final file, for both the manifest and version file.

for s3, do a direct write because object storage itself has atomicity for write operation.

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java

vinothchandar · 2023-07-31T00:23:34Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java

+ * </ul>
+ *
+ * <p><h2>The Legacy Files Cleaning and Read Retention</h2>
+ * Only triggers file cleaning after a valid compaction.


I think we can use OCC here for concurrency control between LSM merge and writer? Even just taking a lock and letting write or lsm merge fail if there was sth concurrent would be ok?

Now for multi-writer, we alredy have a explicit lock guard for the archiver. The lsm merge now is an inline action right after the write, we might need to support async compaction in the future ? I have no idea yet.

yeah for now we assume this is within a lock and done inline.

File a JIRA to track?

https://issues.apache.org/jira/browse/HUDI-6626

vinothchandar · 2023-07-31T00:24:45Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java

+ * Keeps only 3 valid snapshot versions for the reader, that means, a file is kept for at lest 3 archival trigger interval, for default configuration, it is 30 instants time span,
+ * which is far longer that the archived timeline loading time.
+ *
+ * <p><h3>Instants TTL</h3></p>


I'd prefer lazily loading it vs ignoring it completely. i.e the LSM performs well if reading is within 1 week in the past, but should be correct always.

+1 too, by default we can eagerly load 3 ~ 7 days of instants into the memory for fast look up of completion time, if there is any instant out of this range, do a lazy loading for it, the performance should be not too bad because we have a data skipping on files.

vinothchandar · 2023-07-31T00:25:16Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java

+ * <p><h2>The Legacy Files Cleaning and Read Retention</h2>
+ * Only triggers file cleaning after a valid compaction.
+ *
+ * <p><h3>Clean Strategy</h3></p>


Not following. are you talking about cleaning of the Hudi data table?

No, the cleaning of the LSM tree legacy files itself.

lets include the info we need for e.g timestamp of when a version was created? to help retain x hours of versions.

Either is okay, the version number based cleaning works better when the timeline is committed more frequently, because we do not need timetravel queries on the timeline, the cleaning can be more radical.

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java

danny0405 · 2023-08-01T08:08:59Z

can you please split the code earlier into a "org.apache.hudi.storage.lsm" package? where we keep the parquet LSM code away from its use in the ArchivedTimeline? I think it ll help with even reviewing code a lot.

Moved the LSM write path codes into ArchivedTimelineWriter.

vinothchandar

Looks very promising and like the overall direction. Left tons of naming and layering comments and pointed out some gaps.

Could you please resolve the naming, code structure ones and comment on how we can make this more full fledged especially the compaction implementation.
Please resolve the comments that are addressed or we have aligned on, So we can track the pending items along easily.

vinothchandar · 2023-08-03T03:23:43Z

hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java

+   * @param path the file to be deleted.
+   * @return IOException caught, if any. Null otherwise.
+   */
+  private static IOException tryDelete(FileSystem fs, Path path) {


please use Option

vinothchandar · 2023-08-03T03:26:18Z

hudi-common/src/main/java/org/apache/hudi/common/util/ArchivedInstantReadSchemas.java

+ * Avro schema for different archived instant read cases.
+ */
+public abstract class ArchivedInstantReadSchemas {
+  public static final Schema SLIM_SCHEMA = new Schema.Parser().parse("{\n"


rename: TIMELINE_LSM_SLIM_READ_SCHEMA

vinothchandar · 2023-08-03T03:28:00Z

.../src/test/scala/org/apache/spark/sql/execution/benchmark/ArchivedTimelineReadBenchmark.scala

+import java.util
+import scala.collection.JavaConverters._
+
+object ArchivedTimelineReadBenchmark extends HoodieBenchmarkBase {


is this a jmh benchmark?

vinothchandar · 2023-08-07T00:54:18Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java

  private final int maxInstantsToKeep;
  private final int minInstantsToKeep;
  private final HoodieTable<T, I, K, O> table;
  private final HoodieTableMetaClient metaClient;
  private final TransactionManager txnManager;

+  private final ArchivedTimelineWriter timelineWriter;


archivedTimelineWriter?
Rename?

vinothchandar · 2023-08-07T00:55:28Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java

      } else {
        LOG.info("No Instants to archive");
      }
-
-      if (shouldMergeSmallArchiveFiles()) {


Can you confirm we are removing this functionality fully within this pull request?

Kind of, it is replaced with the new compaction, but we still keep one config option, that is the number of bacth files for each compaction input source.

vinothchandar · 2023-08-09T01:37:03Z

...nt/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivedTimelineWriter.java

+    }
+  }
+
+  private Map<String, Boolean> deleteFilesParallelize(


move to FSUtils?

vinothchandar · 2023-08-09T01:37:51Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java

- * <p>
- * </p>
- * This class can be serialized and de-serialized and on de-serialization the FileSystem is re-initialized.
+ * Represents the Archived Timeline for the Hoodie table.


please this class free from LSM design. Single Responsibility Principle

vinothchandar · 2023-08-09T01:38:08Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieArchivedTimeline.java

+ * We have no limit for the layer number, assumes there are 10 instants for each file in L0, there could be 100 instants per file in L1,
+ * so 3000 instants could be represented as 3 parquets in L2, it is pretty fast if we use concurrent read.
+ *
+ * <p>The benchmark shows 1000 instants read cost about 10 ms.


move all this out to LSMTimeline class.

vinothchandar · 2023-08-09T01:39:47Z

...nt/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivedTimelineWriter.java

+  private List<String> getCandidateFiles(List<HoodieArchivedManifest.FileEntry> files, int filesBatch) throws IOException {
+    List<String> candidates = new ArrayList<>();
+    long totalFileLen = 0L;
+    long maxFileSizeInBytes = 1024 * 1024 * 1000;


pull into constant

vinothchandar · 2023-08-09T01:40:38Z

...nt/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivedTimelineWriter.java

+  public void compactAndClean(HoodieEngineContext context) throws IOException {
+    // 1. List all the latest snapshot files
+    HoodieArchivedManifest latestManifest = HoodieArchivedTimeline.latestSnapshotManifest(metaClient);
+    int layer = 0;


so we don't compact beyond layer 0 now?

We have no limit for layers.

vinothchandar

Still few places to cleanup; but LGTM overall. Please revert the one change that seems unrelated (or lmk if its related).

vinothchandar · 2023-08-22T03:36:26Z

hudi-common/src/main/avro/HoodieLSMTimelineInstant.avsc

+      },
+      {
+         "name":"plan",
+         "type":["null", "bytes"],


I am thinking of the scenario where we want users to write SQL to query the timeline. if we do bytes, we need to probably provide udfs for converting from bytes to a plan? Follow up JIRA? (I think this is still better than nested schema, which can be expensive to write.)

Yeah, bytes is more effocient. Fired a follow up JIRA: https://issues.apache.org/jira/browse/HUDI-6747

vinothchandar · 2023-08-24T22:52:11Z

...-client-common/src/main/java/org/apache/hudi/client/utils/LegacyArchivedMetaEntryReader.java

+  private Option<String> getMetadataKey(String action) {
+    switch (action) {
+      case HoodieTimeline.CLEAN_ACTION:
+        return Option.of("hoodieCleanMetadata");


any existing constants we can use for this? else ignore.

vinothchandar · 2023-08-25T00:26:39Z

...-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/LSMTimelineWriter.java

+import java.util.stream.Collectors;
+
+/**
+ * An archived timeline writer which organizes the files as an LSM tree.


remove "archived"

vinothchandar · 2023-08-25T00:27:31Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ActiveAction.java

+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.utils;


can we make a org.apache.hudi.client.timeline package and move all these classes there.

vinothchandar · 2023-08-25T00:29:19Z

...-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/LSMTimelineWriter.java

+      LOG.info("Writing schema " + wrapperSchema.toString());
+      for (ActiveAction activeAction : activeActions) {
+        try {
+          if (preWriteCallback != null) {


can we try to use Option instead of null as sentinels?

vinothchandar · 2023-08-25T00:34:10Z

...-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/LSMTimelineWriter.java

+   * Returns a new file name.
+   */
+  private static String newFileName(String minInstant, String maxInstant, int layer) {
+    return minInstant + "_" + maxInstant + "_" + layer + HoodieFileFormat.PARQUET.getFileExtension();


String.format?

vinothchandar · 2023-08-25T00:38:43Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieTimelineArchiver.java

-      if (!completedCommitBeforeOldestPendingInstant.isPresent()
-          || HoodieTimeline.compareTimestamps(oldestPendingInstant.get().getTimestamp(),
-          LESSER_THAN, completedCommitBeforeOldestPendingInstant.get().getTimestamp())) {
+      if (!completedCommitBeforeOldestPendingInstant.isPresent()) {


why do we need this change? this may break sth. please revert if unnecessary for this change.

Because the check

HoodieTimeline.compareTimestamps(oldestPendingInstant.get().getTimestamp(), LESSER_THAN, completedCommitBeforeOldestPendingInstant.get().getTimestamp())

is always false. I can revert the change.

vinothchandar · 2023-08-25T00:44:05Z

hudi-common/src/main/avro/HoodieLSMTimelineInstant.avsc

+   "type":"record",
+   "name":"HoodieLSMTimelineInstant",
+   "namespace":"org.apache.hudi.avro.model",
+   "fields":[


lets add a version field to each record, so we can evolve as we go if needed

vinothchandar · 2023-08-25T00:45:57Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/LSMTimeline.java

+  /**
+   * Parse the maximum instant time from the file name.
+   */
+  public static String getMaxInstantTime(String fileName) {


do these methods have UTs?

Tests added.

vinothchandar · 2023-08-25T00:47:05Z

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/LSMTimeline.java

+  /**
+   * Parse the minimum instant time from the file name.
+   */
+  public static String getMinInstantTime(String fileName) {


instead of individual parsing methods, can introduce a POJO here, with getters. i.e

LSMFile class with min, max, level as fields?

Did see much gains here, we can extend it in the near future if we have more complex name parsing on the file names.

* Maintain only one version pointer file, add file size limination to compaction strategy * write the manifest as JSON, move the timeline write path to separate class for convenient review

hudi-bot · 2023-08-28T18:00:01Z

CI report:

d688280 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

* replace the log based archived timeline with new parquet based timeline * the timeline is organized as an LSM tree, it has multiple versions for read/write snapshot isolation * maintain only one version pointer file, add file size limination to compaction strategy * write the manifest as JSON

waywtdcc · 2023-10-23T09:00:31Z

Hello, does the master branch now support lsm format merge? @danny0405

danny0405 · 2023-10-24T01:15:17Z

Hello, does the master branch now support lsm format merge? @danny0405

No, only the archived timeline uses LSM layout for instants access.

BruceKellan · 2023-11-14T09:10:50Z

...hudi-client-common/src/main/java/org/apache/hudi/client/timeline/HoodieTimelineArchiver.java

+            throw new HoodieException(e);
+          }
+        };
+        this.timelineWriter.write(instantsToArchive, Option.of(action -> deleteAnyLeftOverMarkers(context, action)), Option.of(exceptionHandler));
        LOG.info("Deleting archived instants " + instantsToArchive);
        success = deleteArchivedInstants(instantsToArchive, context);


if we need to consider how to handle deletion failure exception?

We should, but currently the redundants left by failed deletion do not affect the correctness, the completion time would be still loaded correctly if an instant locates at both avtive and archived timelines.

We need to think through the design though, it is arduous to maintain the whole multiple handlings as atomic.

No more doubts.

vinothchandar added the big-needle-movers label Jul 18, 2023

vinothchandar self-assigned this Jul 18, 2023

danny0405 force-pushed the HUDI-6539 branch 3 times, most recently from e88d7dc to 8784e5f Compare July 20, 2023 10:20

danny0405 added the release-1.0.0 label Jul 20, 2023

danny0405 force-pushed the HUDI-6539 branch 4 times, most recently from 6926ee7 to a71bc4b Compare July 21, 2023 09:26

danny0405 changed the title ~~[WIP][HUDI-6539] New LSM tree style archived timeline~~ [HUDI-6539] New LSM tree style archived timeline Jul 24, 2023

danny0405 force-pushed the HUDI-6539 branch 5 times, most recently from c281ded to a7f8558 Compare July 25, 2023 07:11

vinothchandar reviewed Jul 31, 2023

View reviewed changes

danny0405 force-pushed the HUDI-6539 branch from 9e7266d to 1874d4f Compare August 1, 2023 02:30

danny0405 force-pushed the HUDI-6539 branch 3 times, most recently from 4ade37c to 57c1b84 Compare August 3, 2023 05:16

vinothchandar requested changes Aug 9, 2023

View reviewed changes

danny0405 force-pushed the HUDI-6539 branch from 57c1b84 to 803df61 Compare August 9, 2023 10:20

vinothchandar approved these changes Aug 25, 2023

View reviewed changes

danny0405 force-pushed the HUDI-6539 branch 4 times, most recently from 613a47b to 1739fa3 Compare August 25, 2023 09:04

[HUDI-6539] New LSM tree style archived timeline

d688280

* Maintain only one version pointer file, add file size limination to compaction strategy * write the manifest as JSON, move the timeline write path to separate class for convenient review

danny0405 force-pushed the HUDI-6539 branch from 1739fa3 to d688280 Compare August 26, 2023 01:54

apache deleted a comment from hudi-bot Aug 28, 2023

danny0405 merged commit d924f18 into apache:master Aug 29, 2023
27 checks passed

BruceKellan reviewed Nov 14, 2023

View reviewed changes

danny0405 mentioned this pull request Dec 5, 2023

[HUDI-7172] Fix the timeline archiver to support concurrent writer #10244

Merged

4 tasks

[HUDI-6539] New LSM tree style archived timeline #9209

[HUDI-6539] New LSM tree style archived timeline #9209

Conversation

danny0405 commented Jul 16, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danny0405 Jul 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar Aug 1, 2023 • edited Loading

Choose a reason for hiding this comment

vinothchandar Aug 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danny0405 Jul 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danny0405 Aug 1, 2023 • edited Loading

Choose a reason for hiding this comment

danny0405 commented Aug 1, 2023

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danny0405 Jul 31, 2023 •

edited

Loading

vinothchandar Aug 1, 2023 •

edited

Loading

vinothchandar Aug 1, 2023 •

edited

Loading

danny0405 Jul 31, 2023 •

edited

Loading

danny0405 Aug 1, 2023 •

edited

Loading

danny0405 Nov 15, 2023 •

edited

Loading