-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add signatures for exactly once sink #151
Add signatures for exactly once sink #151
Conversation
@clmccart kindly review this PR. Thanks! |
|
||
// Used for Flink metrics. | ||
private long totalRecordsWritten; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also add totalRecordsSeen
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Done.
/** | ||
* Sink to write data into a BigQuery table using {@link BigQueryBufferedWriter}. | ||
* | ||
* <p>Depending on the checkpointing mode, this writer offers following consistency guarantees: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we can just use this writer with checkpointing in ALO mode, does this mean that the BigQueryDefaultSink is redundant now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great question, and the answer is no. Fundamental critical difference is the write streams they use. ALO's writer uses default stream (data instantly written to BQ table), while this writer first buffers data and then requires an explicit commit (BQ's FlushRows API) so that data is written to the table. Hence, our ALO solution ensures data is instantly available in BQ at the cost of duplicates, while exactly once solution ensures data consistency but at the cost of availability (data is written to BQ only at Flink application's checkpoints).
* Sink to write data into a BigQuery table using {@link BigQueryBufferedWriter}. | ||
* | ||
* <p>Depending on the checkpointing mode, this writer offers following consistency guarantees: | ||
* <li>{@link CheckpointingMode#EXACTLY_ONCE}: exactly-once write consistency. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(here and below) should this be CheckpointingMode#EXACTLY_ONCE or DeliveryGuarantee#EXACTLY_ONCE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CheckpointingMode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DeliveryGuarantee
controls which BQ sink is used, ALO or exactly once. CheckpointingMode
controls Flink's internal consistency handling. There is semantic overlap between the two, but the user needs to set both at separate stages. So, we have documented the outcome for different combinations.
|
||
package com.google.cloud.flink.bigquery.sink.throttle; | ||
|
||
/** Limits the rate at which an operation can be performed. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we document why this is needed? i assume it's because the storage write api doesnt have built in throttling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be documented in a future PR: https://github.com/jayehwhyehentee/flink-bigquery-connector/blob/exo_writer/flink-1.17-connector-bigquery/flink-connector-bigquery/src/main/java/com/google/cloud/flink/bigquery/sink/throttle/WriteStreamCreationThrottler.java#L25-L38.
Also, we will add this in the root README
@@ -40,8 +40,7 @@ | |||
* | |||
* <p>Records are grouped to maximally utilize the BigQuery append request's payload. | |||
* | |||
* <p>Depending on the checkpointing mode, this writer offers either at-least-once or at-most-once | |||
* consistency guarantee. | |||
* <p>Depending on the checkpointing mode, this writer offers following consistency guarantees: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "the following"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Interfaces and classes that define the structure of exactly-once sink feature.
/gcbrun