Issue/refactor filters #185

colmsnowplow · 2022-08-08T10:39:53Z

Once I've had a sense check on this PR, I'll rebase this and squash into a single issue - have left it this way to leave the reasoning as auditable as I can.

Context: In reviewing the release/1.0.0 PR, I had some questions/concerns about how the fitlers worked. It then became apparent that the way we had factored the code made it very difficult to decipher the actual behaviour. This is all stuff that should've come out in the feature PR review, so that much is on me.

Since it was quite difficult to resolve those questions, and create unit tests to illustrate them etc., I went ahead and created this PR to refactor the code and make that process easier. On doing so I found a number of issues with the filters - some flaws in design and some other incorrect behaviours for which we had incorrect unit tests.

@istreeter - as mentioned on slack I've asked for review as a sense check for:
a) That this way of factoring the code is readable/sensible
b) That the various things I've directly called out in comments make sense, and are sensible decisions.

The below is more a description of the steps I took, feel free to peruse it too but it's a lot of info and it's not necessarily relevant to doing the above two things - so feel free not to pay attention. I'm using it as both a way to record what's in this PR for Robert's benefit, and also as basically what I'm putting in the issue description for posterity's sake.

@TiganeteaRobert - I've tried to create this PR in a way that's hopefully useful to review so that you can each of those things. Below is a description of what I did - I think it might be useful for you to take a look through, then we can discuss when we catch up tomorrow if you have any questions.

First, I refactored the code structure: 5dcaaa4

The main thing here is that we had two functions which handled multiple distinct pieces of logic (one for each type of filter) - I refactored to make those independent pieces of logic testable on their own.

I also changed the API for unstruct filters here to make the checks on event name and version explicit on configuration.

There's also some factoring out of compiling regexes: b2c0d68

This is for two reasons, 1. efficiency, 2. If we are to fail to compile, it should only happen at startup. I know we have a validation check on this so it's not a major concern.

Issues with nil values: e643a52

These are two of the bugs that I found originally, which was hard to understand without refactoring. In sum, if we don't retrieve any value, the v1 PR behaviour is to always return false. However if we're matching negatively it should be true when we don't find the value.

Secondly, if we do retrieve a value, but it's nil, since we're using janky fmt.Sprint() to cajole it into string, we actually match against the string <nil> - this is down to pre-existing jankery for which I've created an issue: #186

Using Compile() instead of MustCompile(): 61dbba8

We don't ever want to panic, it's always better to handle the error.

db5841e

This is primarily down to my now thinking the design decision we (I really) landed on in the analytics SDK was wrong. We should fix it there. Issue related to that can be found here: snowplow/snowplow-golang-analytics-sdk#36

Fixing the testing:
5dcaaa4
b77d244

Please pay particular attention to these. There were a number of tests passing that were configured to test for the opposite of the desired behaviour. It's important to step back and consider what the actual desired behaviour is for each test case.

Edit to add: Just for completeness of info, I've made issues for some future improvements we could make, which I didn't feel were worth delaying v1 release further. Happy to hear any dissenting opinions on them, or on their being pushed out to post-v1. :)

We should avoid negative lookaheads: Change the filter API to avoid negative lookahead #188
Config improvements: Change the filter API to avoid negative lookahead #188
Simplify unit tests: Simplify unit tests #190
Fail earlier on misconfiguration of base filter: Base filters should fail on startup when the field provided is not an atomic field #191

colmsnowplow · 2022-08-08T10:41:17Z

pkg/transform/snowplow_enriched_filter.go

+	// if valuesFound is nil, we found no value.
+	// Because negative matches are a thing, we still want to match against an empty string
+	if valuesFound == nil {
+		valuesFound = make([]interface{}, 1)


Previously, when there were no values, we never evaluated anything (just returned false)

This is important as a negative lookahead should evaluate to true when we have no value (beacuse "" != "someValue")

colmsnowplow · 2022-08-08T10:47:43Z

pkg/transform/snowplow_enriched_filter.go

-			return nil, nil
+	for _, v := range valuesFound {
+		if v == nil {
+			v = "" // because nil gets cast to `<nil>`


This is necessary because we're using fmt.Sprint() to cast everything to string when we evaluate the filter.

Really, this is a workaround to cover up some jank - I've opened an issue to do better here: #186

It's my mistake here, I did not realise that nil does not work properly with Sprintf. We should indeed find a better solution here. My vote goes towards handling the different possible types in the existing filter.

colmsnowplow · 2022-08-08T10:52:37Z

pkg/transform/transformconfig/transform_config.go

-	RegexTimeout int    `hcl:"regex_timeout,optional"`
+	// For native filters
+	Description               string `hcl:"description,optional"`
+	UnstructEventName         string `hcl:"unstruct_event_name,optional"`


For unstruct filters (ie filters on data within custom events) - we actually check the event name (because the same fields can be present in separate events), and optionally the version (if provided).

In this refactor, I realised that these should be explicit options - this forces the user into awareness that those checks are part of this filter.

I like it, tbh before this it was a bit confusing when it came to the path to the field that we wanted to apply the filter on, this makes it much friendlier.

colmsnowplow · 2022-08-08T11:02:23Z

pkg/transform/snowplow_enriched_filter.go

-// createSpEnrichedFilterFunction returns a TransformationFunction which filters messages based on a field in the Snowplow enriched event
-// and a regex declared by the user.
-func createSpEnrichedFilterFunction(field, regex string, regexTimeout int, isUnstructEvent bool) (TransformationFunction, error) {
+func createSpEnrichedFilterFunction(regex string, regexTimeout int, getFunc valueGetter) (TransformationFunction, error) {


Just to call out as it may make it easier to make sense of the changes:

This create function (and the make functions for Getters) run once, on startup. The filter function itself runs once per batch of event (which at the moment is usually batches of 1 event - so for every event).

The idea of this refactor is:

We only compile regexes once

We keep DRY principle

We outsoruce the part which gets the value to a separate function, which is independently unit testable. (getter)

The nuance of unstruct filters doing more work is also shifted to the getter function - which is appropriate to the design IMO: if the event name or version don't match, we should behave as if the field desired simply is nil.

This is brilliant, I love the new structure both from a readability and from a technical view.

colmsnowplow · 2022-08-08T11:32:05Z

pkg/transform/snowplow_enriched_filter.go

+
+		valueFound, err := parsedMessage.GetUnstructEventValue(convertPathToInterfaces(path)...)
+		// We don't return an error for empty field since this just means the value is nil.
+		if err != nil && err.Error() != analytics.EmptyFieldErr && !strings.Contains(err.Error(), "not found") {


Since fields can be optional, I've changed this to return nil when we can't find the field. Actually this is a problem with the API design in the analytics SDK so there's an issue to fix that also.

istreeter

I very much like the approach, and your comments in the PR were helpful.

Feel free to ignore my comment about edge cases -- honestly it's fine. But for reference, in enrich we use a 3rd party jsonpath library to handle this kind of thing, e.g. when parsing configs for api enrichment, sql enrichment, pii enrichment.

istreeter · 2022-08-08T20:25:51Z

pkg/transform/snowplow_enriched_filter.go

+func NewSpEnrichedFilterFunctionUnstructEvent(eventNameToMatch, eventVersionToMatch, pathToField, regex string, regexTimeout int) (TransformationFunction, error) {
+	// This regex retrieves the path fields
+	// (e.g. field1.field2[0].field3 -> [field1, field2, 0, field3])
+	regexWords := `\w+`


Just for fun....

I can think of edge cases that will not work with this implementation. You might tell me these edge cases are crazy and we should ignore them, and tbf you would be correct.

Edge case 1

schema:

"properties": { "foo-bar": { "type": "string" } }

valid event:

{ "foo-bar": "baz" }

And imagine pathToField is foo-bar. This would break because \w+ would create an array of two items ["foo", "bar"].

Edge case 2

"properties": { "123": { "type": "string" } }

valid event:

{ "123": "foo" }

And imagine pathToField is 123. This would break because snowplow_enriched_util would parse it as an integer and then the analytics sdk would treat it as an array index.

No I don't think they're crazy, I simply didn't think of them.

The analytics SDK's API is this way (expecting a list of interface arguments), is because it uses json-iterator under the hood, which expects a list of interfaces. We use it because it allows for retrieving values without having to Unmarshal the JSON - and so is vastly more efficient than doing so.

At a cursory glance I don't think there's a jsonpath library that doesn't require unmarshalling. However there may be a solution which is better than what we have (albeit likely less reliable than using a full jsonpath library). I'll write some tests and then see what I can come up with.

I've created an issue on the analytics SDK repo to explore a different JSON parser that looks like it may offer a simpler API (essentially jsonpath dot notation) for this task, without necessarily impeding performance: snowplow/snowplow-golang-analytics-sdk#38

I don't think blocking this release on that is the right call, however, so I'm still exploring whether I can provide an improved solution to what we have here also.

OK I have now come up with something different - but I don't know if it's better. What I've done is:

Make explicit config option for the full context name, rather than including that in dot notation

Replaced convertPathToInterfaces with parsePathToArguments (the rename is unnecessary now that I look at how it shakes out - it wasn't at first)

parsePathToArguments splits on . first, then we determine if it should be an integer via the length of a split on [.

It's kinda jank. I think the preferable solution long term is to change the analytics sdk API.

Also it still has edge cases - we couldn't have [ in the field name (probably not valid anyway), and we can't have a nested arrays (field[0][1] won't work). These to me seem less likely than the others.

Still, I'm not 100% about this - if we think it's too jank I'm happy to revert back to what we had. WDYT @istreeter ?

istreeter · 2022-08-09T20:07:13Z

pkg/transform/snowplow_enriched_filter.go

+	return createSpEnrichedFilterFunction(regex, regexTimeout, getUnstructValuesForMatch)
+}
+
+func parsePathToArguments(pathToField string) ([]interface{}, error) {


Is there a neater implementation using something like:

re := regexp.MustCompile(`\w+|\[\d+\]`) parts := re.FindAllString(pathToField, -1) convertedPath := make([]interface{}, 0) for _, s := range parts { // if it looks like `[42]` then extract the integer and treat it as an array index // otherwise treat it as an object key append(convertedPath, ???) }

That regex in my previous suggestion was flawed. This one might be better:

re := regexp.MustCompile(`\[\d+\]|[^\.\[]+`)

I'm not 100% sure this method will end up being neater than what you did. But I think it's worth following through with the effort, to see what the end result looks like and then to decide which is neater.

not sure why I didn't play with a different regex first, just pivoted to something else... 🤔 let me give it a test and see where I get to.

@istreeter after experimenting with it, this way is definitely better, covers more cases, and leaves us with cleaner code.

The one edge case that was still present was when the string provided contains an unclosed opening brace (like the case in our last unit test case: test1.test[2.test3. Because the first regex strips the opening brace, it thinks that's just a "2" - which means it wouldn't fail, and since a non-existent path just means 'match against nil'. It's an unlikely edge case but there's some downside.

I tried a different way of chopping it up to handle this explicitly, but it left us with less readable code and was quite awkward. But thinking about this case, I don't think it's likely that we have an unmatched [ in the path without it being an error. On that assumption, I just added a check that count([) == count(]) in the string before proceeding (and if it doesn't, we throw an error on startup).

I don't think we need to worry about such weird edge cases. If a user really has such weird field names, then they just have to accept that they cannot use stream replicator to perform transformations. That's OK, I think they'll get over it. Or they will re-design their schemas!

(What's important to me, is that nothing else breaks if a schema has weird fields. I mean, it would be bad if someone could generate an unexpected event that makes the replicator crash; but that's not what we're discussing here)

In other words, I think what you've done is good. It's a nice balance between trying to support all schemas, but not going overboard in unnecessarily supporting weird edge cases which don't deserve our support!

Hang on..... does this regexp work, and also fix the edge case you were worried about?

re := regexp.MustCompile(`\[\d+\]|[^\.]+`)

It leaves us with parts like test1[1], which then needs to be handled gain.

That's not really a problem but I prefer what we have in the PR becuase I think it's actually preferable to throw the error when there's a miscount of braces - it's far more likely that this is misconfiguration than not. Without the check this misconfiguration would lead to the app deploying successfully and incorrectly filtering the data - which is a headache.

So I think it's better to keep the check and throw the error - in which case, I don't think there's a benefit to changing the regex.

colmsnowplow commented Aug 8, 2022

View reviewed changes

colmsnowplow requested a review from istreeter August 8, 2022 12:23

colmsnowplow mentioned this pull request Aug 8, 2022

Refactor v1 release filters #192

Closed

istreeter approved these changes Aug 8, 2022

View reviewed changes

colmsnowplow mentioned this pull request Aug 9, 2022

Release/1.0.0 #143

Merged

14 tasks

istreeter reviewed Aug 9, 2022

View reviewed changes

colmsnowplow force-pushed the issue/refactor-filters branch from 41e0325 to 7507e23 Compare August 11, 2022 15:51

Refactor v1 release filters (closes #192)

58ce661

colmsnowplow force-pushed the issue/refactor-filters branch from 7507e23 to 58ce661 Compare August 11, 2022 16:33

colmsnowplow merged commit 2aaa079 into release/1.0.0 Aug 11, 2022

colmsnowplow deleted the issue/refactor-filters branch August 11, 2022 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue/refactor filters #185

Issue/refactor filters #185

colmsnowplow commented Aug 8, 2022 •

edited

Loading

colmsnowplow Aug 8, 2022

colmsnowplow Aug 8, 2022 •

edited

Loading

TiganeteaRobert Aug 8, 2022

colmsnowplow Aug 8, 2022

TiganeteaRobert Aug 8, 2022

colmsnowplow Aug 8, 2022

TiganeteaRobert Aug 8, 2022

colmsnowplow Aug 8, 2022

istreeter left a comment

istreeter Aug 8, 2022

colmsnowplow Aug 9, 2022 •

edited

Loading

colmsnowplow Aug 9, 2022 •

edited

Loading

colmsnowplow Aug 9, 2022

istreeter Aug 9, 2022

istreeter Aug 9, 2022

colmsnowplow Aug 10, 2022

colmsnowplow Aug 10, 2022 •

edited

Loading

istreeter Aug 10, 2022

istreeter Aug 11, 2022

colmsnowplow Aug 11, 2022

istreeter Aug 11, 2022

Issue/refactor filters #185

Issue/refactor filters #185

Conversation

colmsnowplow commented Aug 8, 2022 • edited Loading

Choose a reason for hiding this comment

colmsnowplow Aug 8, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

istreeter left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Edge case 1

Edge case 2

colmsnowplow Aug 9, 2022 • edited Loading

Choose a reason for hiding this comment

colmsnowplow Aug 9, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colmsnowplow Aug 10, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colmsnowplow commented Aug 8, 2022 •

edited

Loading

colmsnowplow Aug 8, 2022 •

edited

Loading

colmsnowplow Aug 9, 2022 •

edited

Loading

colmsnowplow Aug 9, 2022 •

edited

Loading

colmsnowplow Aug 10, 2022 •

edited

Loading