Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify base and view #2

Closed
wants to merge 61 commits into from

Conversation

agnessnowplow
Copy link
Collaborator

@agnessnowplow agnessnowplow commented Sep 12, 2023

I think it would be worth to take a quick look at the current progress and get some feedback before I continue to make changes on sessions and users as the main logic of unification happens in base and views already so any changes I would have to correct further up as well.

I have added a dummy one line mobile data in the integration tests so that we can see if it actually runs as expected. The integration tests are currently only running to see if they compile and only until derived.views. For Snowflake/Bigquery/Databricks I checked that the mobile line is returned with the extra context fields but it needs further investigation as I progress, although the original target is to only support Snowflake with the first release. For Postgres/Redshift the base macro needed a bit of tweaking to avoid duplication coming from using the same session context for both the user and session identifiers (update: it works now).

What I want to agree on mainly is the logic of unification and which contexts to use. Upon discussion with Matus it turned out that we have started moving towards removing the atomic hardcoded fields, which will be a gradual progress, some users can already use other contexts (e.g. browser context) to replace certain fields etc. I think if we try and make the fields more 'conceptual' then when this happens we can just add a coalesce and keep the integrity. However, to avoid having a 150+ wide tables, the context fields are planned to only be optional, based on what the user selects which is where it may get tricky to get the right balance of unification. I also updated the lists of fields to extract to a more recent schema version, I think this needs double checking still.

Yml files, docs, intro etc are incomplete as of now, please ignore them (as well as optional modules).

To do (things to consider later on in a future PR, depending on time before the release or after):

  • adding the android referrer context
  • adding an optional model for cross platform tracking
  • making sure optional modules: cwv/consent work
  • porting over mobile optional module: app_errors (updating if needed)
  • replace engagement calculation for mobile (once the tracking side is finished within this quarter)
  • add proper integration test (a simplified version at that)
  • consider removing parts from yml /dbt tests
  • check if we need to remove the exclude fields macro or move it to utils
  • how does it handle anonymous tracking?

Copy link
Contributor

@rlh1994 rlh1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is absolutely fantastic work, I've got a couple (21, lol) of comments which are mostly focused around setting up best possible for the future, or to decrease code duplication even further than you have.

Seriously, not having a model for every single warehouse is HUGE, and the 2-model approach to first standardising and then coalescing makes it really clean and clear where we manage the unification. Doing the field extraction and unification upfront also is so great to be able to pull a load of complexity and if blocks out from downstream, amazing!

dbt_project.yml Outdated Show resolved Hide resolved
dbt_project.yml Outdated Show resolved Hide resolved
macros/exclude_columns.sql Outdated Show resolved Hide resolved
Comment on lines 23 to 33
{% if var('snowplow__enable_mobile') %}
ev.session__previous_session_id,
{% endif %}

-- user id fields
ev.user_id,
ev.user_identifier,
ev.domain_userid,
{% if var('snowplow__enable_mobile') %}
ev.session__user_id as device_user_id,
{% endif %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think it might be worth us having a discussion as a team about column ordering, as this could really help simplify the code and increase readability if we group all the web-only and mobile-only fields together. Probably have this sooner rather than later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I kept reading and saw you mostly already did this, but I still think it would be good to have a style guide around this and decide how strict on that we want to be

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if nothing else, you can always group them in the CTE and then do a final order at the end...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have simplified the views by blockifying fields, I think it made it less error prone and more readable but you tell me.

models/views/scratch/snowplow_unified_pv_engaged_time.sql Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there were discussions of tracking active time directly in mobile at some point @matus-tomlein? I guess at that point we could just add it into here if it exists, use the calculation otherwise?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's on our roadmap for this quarter to come up with a solution for engagement tracking on mobile. Sorry I don't fully understand the comment – is there any other calculation that we currently do on mobile? I can only see the one for Web based on page pings, just wondering if I'm missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that's fine, that's what I mean is that we don't currently have an approach for mobile, but once that's ready we could add that to this model and it would just slot in downstream

Comment on lines 215 to 217

row_number() over (partition by ev.view_id order by ev.derived_tstamp, ev.dvce_created_tstamp) as view_id_dedupe_index

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want other peoples opinions on this, but given there's literally 3 lines of difference between postgres and every other warehouse on a ~700 line file, I wonder if we should just wrap those lines in an if target.type... instead and only have 1 file, which would make updates or anything else much easier to manage?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are many if blocks due to the various conditions so I don't think this would cause much difference when it comes to readability, if that is the concern. I have made the changes just so that we can actually see how it looks in real life and it needed 3 blocks of code like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal opinion is 3 blocks to remove duplicated 690 lines is a very worthwhile trade, but not sure what @emielver and @georgewoodhead think?

macros/unify_fields_query.sql Show resolved Hide resolved
macros/unify_fields_query.sql Outdated Show resolved Hide resolved
Comment on lines 507 to 511
-- device fields
pve.app_id,
pve.platform,
pve.device_category,
{% if var('snowplow__enable_mobile_context') %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should bring these fields up more to the front, because now you are likely to want to filter web/mobile, where before they were (except maybe app_id) mostly the same...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added them behind timestamp fields.

Copy link
Contributor

@matus-tomlein matus-tomlein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really exciting!

domain_userid,
domain_sessionidx,
network_userid,
geo_country,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the nitpick, but would there be a way to distinguish the geo properties that are calculated based on the IP address (such as this one), from the ones based on the geolocation context (which are reported from the device)? They are quite different and there is a lot of confusion around that from users. Maybe we could use geo_ip_... and geo_dvce_... as the prefixes? This might be out of scope for this PR, so ignore me in that case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a very valid point and I'm not sure how it was or wasn't distinguished in the scratch table for redshift/posttgres as I now unified them, but later on you will see that there is a separate --geo fields and -- mobile only fields with geo__ prefix (double underscore always indicates it comes from a context) perhaps I should change __geo to __geo_dvce so that the distinction is even more clear?

ev.domain_sessionidx,
{% endif %}
{% if var('snowplow__enable_mobile') %}
ev.session__session_index,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we flip the order of this around? I'm not sure about domain_sessionidx existing for mobile but if it does we need to take session__session_index, and session__session_index definitely doesn't exist for web so no harm done in swapping them around

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can exist on Web (it's an optional configuration now) and we plan to make it default in the v4 version of the tracker (probably). Still no harm done if the order is swapped because it has the same info as domain_sessionidx. But in the future we plan not to deprecate the domain_sessionidx and use the session context on Web too, so we would probably need to decouple the session context from the snowplow__enable_mobile flag (but that is a problem for the future).

emielver and others added 26 commits September 13, 2023 17:06
@agnessnowplow
Copy link
Collaborator Author

Thanks for all the comments so far! I am now actively working on sessions (discovering more bugs etc) but keeping this PR open for a little while so that you can see the replies but will close/merge this soon.

@emielver emielver deleted the feature/unify_views branch December 1, 2023 11:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants