Enabling case sensitive table and column names #163

gpodevijn · 2019-10-24T17:44:02Z

I would like to understand the reason of forcing table and column names to be lower case.

In postgres.py, on lines 385 and 392, the regexes only accept lowercase identifiers (and the error messages are in line with that).

What would be potential consequences of allowing case sensitive identifier names? For example, changing the regexes by:

'^[a-z_A-Z].*' and '^[a-z0-9_$A-Z]+$'

Thank you.
Regards

The text was updated successfully, but these errors were encountered:

AlexanderMann · 2019-10-25T17:35:48Z

@gpodevijn good question!

So the main rationale is that it's easier to query identifier values in PostgreSQL which all match. The reason for this is that PostgreSQL by default ignores case when querying values. ie, the following queries match:

SELECT a, b, c
FROM FoO

SELECT A, B, c
FROM foo

In order to query by case sensitive values, we have to enclose the above in double quotes. It's a minor barrier to working with a produced schema, but alas.

The other reason for this is that this is what Stitch Data does. Per our docs here we try to match up with their resulting data store whenever possible to make use of their community as best as possible.

gpodevijn · 2019-10-28T09:11:17Z

@AlexanderMann thanks a lot for the explanations!

I have three follow-up questions.

When you're saying In order to query by case sensitive values, we have to enclose the above in double quotes -- is that correct that you're saying this from a user/data consumer perspective (i.e., there wouldn't be any implications for the target or any other tap)?

Matching up what Stitch Data does makes perfect sense to me. However, if I have a use case where I'd like case sensitive table and column names in my Postgres database, would it be safe for me to fork your repo and apply the appropriate changes (the regexes in my first post)?

Finally, do you think that it would make sense to add a parameter to your current implementation where we could choose between lower case (current implementation) or case sensitive identifier names (mentioning that it'd break compatibility with Stitch Data) ?

AlexanderMann · 2019-10-28T15:09:38Z

@gpodevijn I tried to answer everything as best I could. Please feel free to follow up. I tried to note our past decisions where I could, and then also what my interpretation of the various questions was, etc. etc.

1

When you're saying In order to query by case sensitive values, we have to enclose the above in double quotes -- is that correct that you're saying this from a user/data consumer perspective (i.e., there wouldn't be any implications for the target or any other tap)?

I think the answer here is "yes" but I'm a little confused by the question. What follows is my understanding of the question along with rationale presented in a slightly different way to see if any of it helps 😄

The intention behind the decision is to make querying for a data-analyst simpler. ie. if I as a data-scientist have to remember to "everything", it's a slight additional burden to being able to get access to insights etc.

Building applications on top of a Singer target is slippery in the first place, so it's often expected that dealing with further processing post-Singer will require some wrangling etc. ie, it's difficult to expect fields to come over 1-1.

2

would it be safe for me to fork your repo and apply the appropriate changes (the regexes in my first post)?

Most likely yes. But support for that will vary depending on the amount of changes etc. You should be fine, but etc. 🤷‍♀

Question for you!

I would be curious to hear from you what your use case here is? Typically name collisions in a schema are due to a number of conflicting things, not casing alone. I'm curious if you want to have tables named Foo and foo and those be used for distinct things, etc.

3

Finally, do you think that it would make sense to add a parameter to your current implementation where we could choose between lower case (current implementation) or case sensitive identifier names (mentioning that it'd break compatibility with Stitch Data) ?

Soooooooooo...no ❌ 😄

We discussed this at some point, but the problem you run into is that various folks are going to want various re-mapping etc.

For you, you want case sensitivity to be the only thing differing from "what Stitch does". For someone else, they'll want to ditch the _ underscores, and yet another will want no numerics in the names etc. etc.

Our hope by providing a single function to overwrite for name-canonicalization was that intrepid users (such as yourself) could easily fork the repo and get the additional functionality and push the support of that particular decision onto the developer instead of a potentially hairy feature flag in our codebase.

gpodevijn · 2019-10-28T16:37:19Z

@AlexanderMann this is a very detailed answer, thank you very much.

I understand your arguments and your decisions!

So my use case is this one: Our data scientists want all table names and column names to be snake_case. We have an ETL process that loads our database tables into Redshift. When columns are CamelCase, the ETL process renames the columns in snake_case based on a set of specific rules (e.g., MyColumn becomes my_column but ARR becomes arr and not a_r_r). The rules to rename a column from CamelCase to snake_case tries to be as generic as possible (e.g., ARR is not hard coded in the rules). This is actually a set of horrible regexes 🙂 Having all our column names in lower case in postgres prevents me from using the set of regex rules, as MyColumn would become mycolumn and it would be much more difficult to rename it to my_column. I hope what I'm saying makes sense to you 😄

In conclusion, I will fork the repo and implement my changes ⌨️ 🙂

AlexanderMann · 2019-10-29T14:07:57Z

This is actually a set of horrible regexes

Godspeed @gpodevijn ...we've all been there...

Out of curiousity, why are you using target-postgres and not target-redshift?

Additionally, would a feature such as #66 help you out? I'm assuming it'd be even more work to get your ETL-regex-monster to recognize column comments etc...

AlexanderMann closed this as completed Oct 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling case sensitive table and column names #163

Enabling case sensitive table and column names #163

gpodevijn commented Oct 24, 2019 •

edited

Loading

AlexanderMann commented Oct 25, 2019

gpodevijn commented Oct 28, 2019 •

edited

Loading

AlexanderMann commented Oct 28, 2019

gpodevijn commented Oct 28, 2019

AlexanderMann commented Oct 29, 2019

Enabling case sensitive table and column names #163

Enabling case sensitive table and column names #163

Comments

gpodevijn commented Oct 24, 2019 • edited Loading

AlexanderMann commented Oct 25, 2019

gpodevijn commented Oct 28, 2019 • edited Loading

AlexanderMann commented Oct 28, 2019

1

2

Question for you!

3

gpodevijn commented Oct 28, 2019

AlexanderMann commented Oct 29, 2019

gpodevijn commented Oct 24, 2019 •

edited

Loading

gpodevijn commented Oct 28, 2019 •

edited

Loading