Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raccoon needs to add ingestion time to every event #11

Open
chakravarthyvp opened this issue May 28, 2021 · 2 comments
Open

Raccoon needs to add ingestion time to every event #11

chakravarthyvp opened this issue May 28, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@chakravarthyvp
Copy link
Contributor

chakravarthyvp commented May 28, 2021

Problem

We (GoJek) use Raccoon currently to source clickstream events from the gojek app. The concrete product proto contains an event_timestamp field which the downstream systems such as DWH can use to partition the data on. However we see some amount of data arrives in partitions in future dates while some other arrive at different days for the same event timestamp date. There are 2 scenarios that causes this issue:

  1. The time/clock in the mobile app is reset by the user to a future date
  2. The app was inactive and those events were sent at a later point of time by the mobile sdk

Is there any workaround?
The DWH can partition based on a field which is like an ingestion time into the warehouse. However this needs backfills & repartitions on existing data and the upstream applications may need to change the way they query.

What is the impact?
Upstream applications' & services' query returns erroneous results

Which version was this found?
NA

Solution
Raccoon needs to provide an ingestion time for each event. The ingestion time should be considered as the time it was ingested into raccoon. This enables DWH to partition data based on the ingestion time as an alternate option to event_timestamp.

@chakravarthyvp chakravarthyvp added the enhancement New feature or request label May 28, 2021
@burntcarrot
Copy link

I'm assuming that you mean that the ingestion time sits between the event time and the processing time, if yes, how would you want the ingestion time to be integrated into raccoon? Are we supposed to add the field in the product proto itself, or are we supposed to do something else?

Apologies for the newbie question, I don't have any professional experience working with real-time data, I'm just a student.😓

@chakravarthyvp
Copy link
Contributor Author

@burntcarrot - You have a very valid question. The ingestion time should be part of the product proto, which is serialised by the client, as bytes in the Event.proto. Since Raccoon is event agnostic, injecting a timestamp in the product would mean
that the product proto needs to be deserialized first in Raccoon and this breaks the architectural principle. What Raccoon needs is a generic metadata proto that the product protos need to compose and in this way Raccoon can deserialise using these generic protos and inject this timestamp.
Have you other suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Future
Development

No branches or pull requests

2 participants