Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling case sensitive table and column names #163

Closed
gpodevijn opened this issue Oct 24, 2019 · 5 comments
Closed

Enabling case sensitive table and column names #163

gpodevijn opened this issue Oct 24, 2019 · 5 comments

Comments

@gpodevijn
Copy link

gpodevijn commented Oct 24, 2019

I would like to understand the reason of forcing table and column names to be lower case.

In postgres.py, on lines 385 and 392, the regexes only accept lowercase identifiers (and the error messages are in line with that).

What would be potential consequences of allowing case sensitive identifier names? For example, changing the regexes by:

'^[a-z_A-Z].*' and '^[a-z0-9_$A-Z]+$'

Thank you.
Regards

@AlexanderMann
Copy link
Collaborator

@gpodevijn good question!

So the main rationale is that it's easier to query identifier values in PostgreSQL which all match. The reason for this is that PostgreSQL by default ignores case when querying values. ie, the following queries match:

SELECT a, b, c
FROM FoO
SELECT A, B, c
FROM foo

In order to query by case sensitive values, we have to enclose the above in double quotes. It's a minor barrier to working with a produced schema, but alas.

The other reason for this is that this is what Stitch Data does. Per our docs here we try to match up with their resulting data store whenever possible to make use of their community as best as possible.

@gpodevijn
Copy link
Author

gpodevijn commented Oct 28, 2019

@AlexanderMann thanks a lot for the explanations!

I have three follow-up questions.

When you're saying In order to query by case sensitive values, we have to enclose the above in double quotes -- is that correct that you're saying this from a user/data consumer perspective (i.e., there wouldn't be any implications for the target or any other tap)?

Matching up what Stitch Data does makes perfect sense to me. However, if I have a use case where I'd like case sensitive table and column names in my Postgres database, would it be safe for me to fork your repo and apply the appropriate changes (the regexes in my first post)?

Finally, do you think that it would make sense to add a parameter to your current implementation where we could choose between lower case (current implementation) or case sensitive identifier names (mentioning that it'd break compatibility with Stitch Data) ?

@AlexanderMann
Copy link
Collaborator

@gpodevijn I tried to answer everything as best I could. Please feel free to follow up. I tried to note our past decisions where I could, and then also what my interpretation of the various questions was, etc. etc.

1

When you're saying In order to query by case sensitive values, we have to enclose the above in double quotes -- is that correct that you're saying this from a user/data consumer perspective (i.e., there wouldn't be any implications for the target or any other tap)?

I think the answer here is "yes" but I'm a little confused by the question. What follows is my understanding of the question along with rationale presented in a slightly different way to see if any of it helps 😄

The intention behind the decision is to make querying for a data-analyst simpler. ie. if I as a data-scientist have to remember to "everything", it's a slight additional burden to being able to get access to insights etc.

Building applications on top of a Singer target is slippery in the first place, so it's often expected that dealing with further processing post-Singer will require some wrangling etc. ie, it's difficult to expect fields to come over 1-1.

2

would it be safe for me to fork your repo and apply the appropriate changes (the regexes in my first post)?

Most likely yes. But support for that will vary depending on the amount of changes etc. You should be fine, but etc. 🤷‍♀

Question for you!

I would be curious to hear from you what your use case here is? Typically name collisions in a schema are due to a number of conflicting things, not casing alone. I'm curious if you want to have tables named Foo and foo and those be used for distinct things, etc.

3

Finally, do you think that it would make sense to add a parameter to your current implementation where we could choose between lower case (current implementation) or case sensitive identifier names (mentioning that it'd break compatibility with Stitch Data) ?

Soooooooooo...no ❌ 😄

We discussed this at some point, but the problem you run into is that various folks are going to want various re-mapping etc.

For you, you want case sensitivity to be the only thing differing from "what Stitch does". For someone else, they'll want to ditch the _ underscores, and yet another will want no numerics in the names etc. etc.

Our hope by providing a single function to overwrite for name-canonicalization was that intrepid users (such as yourself) could easily fork the repo and get the additional functionality and push the support of that particular decision onto the developer instead of a potentially hairy feature flag in our codebase.

@gpodevijn
Copy link
Author

@AlexanderMann this is a very detailed answer, thank you very much.

I understand your arguments and your decisions!

So my use case is this one: Our data scientists want all table names and column names to be snake_case. We have an ETL process that loads our database tables into Redshift. When columns are CamelCase, the ETL process renames the columns in snake_case based on a set of specific rules (e.g., MyColumn becomes my_column but ARR becomes arr and not a_r_r). The rules to rename a column from CamelCase to snake_case tries to be as generic as possible (e.g., ARR is not hard coded in the rules). This is actually a set of horrible regexes 🙂 Having all our column names in lower case in postgres prevents me from using the set of regex rules, as MyColumn would become mycolumn and it would be much more difficult to rename it to my_column. I hope what I'm saying makes sense to you 😄

In conclusion, I will fork the repo and implement my changes ⌨️ 🙂

@AlexanderMann
Copy link
Collaborator

This is actually a set of horrible regexes

Godspeed @gpodevijn ...we've all been there...

Out of curiousity, why are you using target-postgres and not target-redshift?

Additionally, would a feature such as #66 help you out? I'm assuming it'd be even more work to get your ETL-regex-monster to recognize column comments etc...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants