Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add signatures for exactly once sink #151

Merged
merged 3 commits into from
Sep 13, 2024

Conversation

jayehwhyehentee
Copy link
Collaborator

Interfaces and classes that define the structure of exactly-once sink feature.

/gcbrun

@jayehwhyehentee
Copy link
Collaborator Author

@clmccart kindly review this PR. Thanks!


// Used for Flink metrics.
private long totalRecordsWritten;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also add totalRecordsSeen?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Done.

/**
* Sink to write data into a BigQuery table using {@link BigQueryBufferedWriter}.
*
* <p>Depending on the checkpointing mode, this writer offers following consistency guarantees:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we can just use this writer with checkpointing in ALO mode, does this mean that the BigQueryDefaultSink is redundant now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question, and the answer is no. Fundamental critical difference is the write streams they use. ALO's writer uses default stream (data instantly written to BQ table), while this writer first buffers data and then requires an explicit commit (BQ's FlushRows API) so that data is written to the table. Hence, our ALO solution ensures data is instantly available in BQ at the cost of duplicates, while exactly once solution ensures data consistency but at the cost of availability (data is written to BQ only at Flink application's checkpoints).

* Sink to write data into a BigQuery table using {@link BigQueryBufferedWriter}.
*
* <p>Depending on the checkpointing mode, this writer offers following consistency guarantees:
* <li>{@link CheckpointingMode#EXACTLY_ONCE}: exactly-once write consistency.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(here and below) should this be CheckpointingMode#EXACTLY_ONCE or DeliveryGuarantee#EXACTLY_ONCE?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckpointingMode

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DeliveryGuarantee controls which BQ sink is used, ALO or exactly once. CheckpointingMode controls Flink's internal consistency handling. There is semantic overlap between the two, but the user needs to set both at separate stages. So, we have documented the outcome for different combinations.


package com.google.cloud.flink.bigquery.sink.throttle;

/** Limits the rate at which an operation can be performed. */

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we document why this is needed? i assume it's because the storage write api doesnt have built in throttling?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -40,8 +40,7 @@
*
* <p>Records are grouped to maximally utilize the BigQuery append request's payload.
*
* <p>Depending on the checkpointing mode, this writer offers either at-least-once or at-most-once
* consistency guarantee.
* <p>Depending on the checkpointing mode, this writer offers following consistency guarantees:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "the following"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jayehwhyehentee jayehwhyehentee merged commit 83226c5 into GoogleCloudDataproc:main Sep 13, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants