Skip to content

Anish-Malhotra/mailchimp-technical

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Welcome to my attempt at the Data Engineer II take-home project.

Here I've created a CLI script that asynchronously loads a JSON file in chunks to bulk index the relevant documents.

The approach is made generic so that it can be extended to any other document types or data sources (API stream, polling, etc).

Resolving the yellow status for the 'github-events' index was the most challenging part of this assignment for me.

Some research yielded the following to me:

  • GET /_nodes -> reveals we have a total of 6 nodes in the cluster
  • GET _cluster/settings?pretty -> reveals we don't have an allocation policy setup (auto-rebalance, etc)
  • GET github-events/_settings -> reveals we only allow 1 shard per node
  • GET /_cluster/allocation/explain -> Shows all of the unassigned shards are replicas, and the error message is that the replicas are being stored on the same nodes as the primary shards

I increased the number of shards per node to 2 and enabled rebalancing for all indices, and that seemed to solve the issue.

Resources used:

Potential improvements:

  • Run the parse/transform of the input data in a separate thread
  • Integrate with the Async versions of the Elasticsearch client
  • Improve error messaging/verbosity
  • Add unit tests

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages