Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source versioning: Postgres, MySQL and Load generator #647

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

bobbyiliev
Copy link
Contributor

@bobbyiliev bobbyiliev commented Sep 3, 2024

Initial implementation for the source versioning refactor as per #646

The main changes to consider:

  • Marking the table attribute as optional and deprecated for both the MySQL and Postgres sources
  • Introduced a new all_tables bool attribute for the MySQL and the Loadgen sources, as in the past this was defaulting always using FOR ALL TABLES in the load gen sources (auction, marketing, tpch) and in the MySQL case whenever no table blocks were defined, we defaulted to FOR ALL TABLES. This all_tables bool attribute allows us to create sources without any tables defined as per the source versioning work
  • Introducing the new materialize_source_table_{mysql|postgres|load_generator} resource which allows us to do CREATE TABLE ... FROM SOURCE ...

Things that are still pending: #646

@bobbyiliev bobbyiliev changed the title Source versioning [WIP] Source versioning Sep 3, 2024
@bobbyiliev bobbyiliev force-pushed the source-versioning branch 3 times, most recently from c4a61c8 to e202851 Compare September 9, 2024 13:33
@@ -29,7 +29,7 @@ description: |-
- `ownership_role` (String) The owernship role of the object.
- `region` (String) The region to use for the resource connection. If not set, the default region is used.
- `schema_name` (String) The identifier for the table schema in Materialize. Defaults to `public`.
- `text_columns` (List of String) Columns to be decoded as text.
- `text_columns` (List of String) Columns to be decoded as text. Not supported for the load generator sources, if the source is a load generator, the attribute will be ignored.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to sources, we might want a source table resource per source type? The existing source-level options will basically shift to source table-level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed, I was just thinking about this. With the MySQL and Postgres sources, it is probably fine, but as soon as we add Kafka and Webhook sources, the logic will get out of hand.

Will refactor this to have a separate source table resource per source!

@bobbyiliev bobbyiliev changed the title [WIP] Source versioning Source versioning: Postgres, MySQL and Load generator Sep 13, 2024
@bobbyiliev bobbyiliev marked this pull request as ready for review September 13, 2024 12:29
@bobbyiliev bobbyiliev requested a review from a team as a code owner September 13, 2024 12:29
@bobbyiliev bobbyiliev requested review from arusahni and rjobanp and removed request for a team September 13, 2024 12:29
Copy link

@rjobanp rjobanp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work!

Comment on lines 76 to 77
- `start_offset` (List of Number, Deprecated) Read partitions from the specified offset. Deprecated: Use the new materialize_source_table_kafka resource instead.
- `start_timestamp` (Number, Deprecated) Use the specified value to set `START OFFSET` based on the Kafka timestamp. Deprecated: Use the new materialize_source_table_kafka resource instead.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two options are still currently only possible on the top-level CREATE SOURCE statement for kafka sources -- not yet on a per-table basis. It will require a non-trivial amount more refactoring to allow them on a per-table basis so I'm unsure if we will do that work until it's requested by a customer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes! Good catch! Thank you!

@@ -53,12 +53,12 @@ resource "materialize_source_mysql" "test" {
- `comment` (String) **Public Preview** Comment on an object in the database.
- `database_name` (String) The identifier for the source database in Materialize. Defaults to `MZ_DATABASE` environment variable if set or `materialize` if environment variable is not set.
- `expose_progress` (Block List, Max: 1) The name of the progress collection for the source. If this is not specified, the collection will be named `<src_name>_progress`. (see [below for nested schema](#nestedblock--expose_progress))
- `ignore_columns` (List of String, Deprecated) Ignore specific columns when reading data from MySQL. Can only be updated in place when also updating a corresponding `table` attribute. Deprecated: Use the new materialize_source_table resource instead.
- `ignore_columns` (List of String, Deprecated) Ignore specific columns when reading data from MySQL. Can only be updated in place when also updating a corresponding `table` attribute. Deprecated: Use the new materialize_source_table_mysql resource instead.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi this option is also being renamed MaterializeInc/materialize#29438 but the old name will be aliased to the new one, so this shouldn't break

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I will go ahead and use the exclude columns for the new table source resource!

Comment on lines 45 to 46
- `start_offset` (List of Number) Read partitions from the specified offset.
- `start_timestamp` (Number) Use the specified value to set `START OFFSET` based on the Kafka timestamp.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these aren't currently available on a per-table basis for kafka sources

- `schema_name` (String) The identifier for the source schema in Materialize. Defaults to `public`.
- `start_offset` (List of Number) Read partitions from the specified offset.
- `start_timestamp` (Number) Use the specified value to set `START OFFSET` based on the Kafka timestamp.
- `upstream_schema_name` (String) The schema of the table in the upstream database.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this refer to for kafka sources? we might just want to omit it since the upstream reference should just be the kafka topic name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this was an overlook on my end in the schema for the Kafka source table resource.

Comment on lines 26 to 27
startOffset []int
startTimestamp int
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two aren't used below and also aren't possible on the statement


This guide will walk you through the process of migrating your existing source table definitions to the new `materialize_source_table_{source}` resource.

For each source type (e.g., MySQL, Postgres, etc.), you will need to create a new `materialize_source_table_{source}` resource for each table that was previously defined within the source resource. This ensures that the tables are preserved during the migration process.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For each source type (e.g., MySQL, Postgres, etc.), you will need to create a new `materialize_source_table_{source}` resource for each table that was previously defined within the source resource. This ensures that the tables are preserved during the migration process.
For each source type (e.g., MySQL, Postgres, etc.), you will need to create a new `materialize_source_table_{source}` resource for each table that was previously defined within the source resource. This ensures that the tables are preserved during the migration process. For Kafka sources, you will need to create at least one `materialize_source_table_kafka` table to hold data for the kafka topic.

@morsapaes might have better wording for this but I think we should be clear that this migration needs to happen for sources that previously didn't have subsources too (e.g. kafka)


The same approach can be used for other source types such as Postgres, eg. `materialize_source_table_postgres`.

## Automated Migration Process (TBD)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice - this is great! We will probably want to figure out how to tell them that they will be able to coordinate the 'automated' migration process with their field-engineer representative if they go down this path

@rjobanp
Copy link

rjobanp commented Sep 17, 2024

@morsapaes @bobbyiliev let's discuss this PR at the sources & sinks meeting this week - we should decide when it makes sense to merge this - my thinking is we should do so whenever we move into private preview for the source versioning feature. But if we want to merge sooner and just have a disclaimer that the things mentioned as 'deprecated' here are not actually yet deprecated, that could work too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants