Add the Trino Snowflake connector story post.

trinodb · May 3, 2024 · c18b1b8 · c18b1b8
1 parent a434fff
commit c18b1b8
Showing 1 changed file with 294 additions and 0 deletions.
diff --git a/_posts/2024-05-03-trino-snowflake-bloomberg-oss-win.md b/_posts/2024-05-03-trino-snowflake-bloomberg-oss-win.md
@@ -0,0 +1,294 @@
+---
+layout: post
+title:  "Integrating Trino and Snowflake: An open source success story"
+author: Brian Olsen
+excerpt_separator: <!--more-->
+canonical_url: https://bitsondata.dev/trino-snowflake-bloomberg-oss-win
+---
+
+We’re seeing open source usher in a challenge to the economic model where the 
+success metric is increasing the commonwealth of economic capital. This 
+acceleration comes from playing positive-sum games with friends online and 
+avoiding limiting a community to a vision that only benefits a small number of 
+corporations or individuals. It’s hard to imagine how to embed such frameworks 
+within our current zero-sum winner-takes-all economic system. There’s certainly 
+no shortage of heated debates around how to construct a harmonious relationship 
+between the open source community and companies participating in them. Something
+we don’t talk about enough are the positive examples of when a coordinated 
+effort in open source sticks the landing and so many benefit from it.
+
+This post highlights the extraordinary contributions of 
+[Erik Anderson](https://www.linkedin.com/in/erikanderson/), 
+[Teng Yu](https://www.linkedin.com/in/tyu-fr/), 
+[Yuya Ebihara](https://www.linkedin.com/in/ebyhr/), and the broader 
+[Trino community](https://github.com/trinodb/trino) to finally contribute the 
+long-coveted 
+[Trino Snowflake Connector](https://trino.io/docs/current/connector/snowflake.html).
+
+It is both a success story and some wisdom for corporations using and providing
+services around open source should be involved in open source . This provides a
+blue print for those who want to contribute to open source the types of
+patterns to follow when contributing.
+
+## A common challenge in open source
+
+Despite my love for marketing and educating a community on tech (aka 
+[edutainment](https://en.wikipedia.org/wiki/Educational_entertainment)), it’s
+only the first part of the equation of what I aim to do as a developer advocate.
+Once developers  see some exciting video or tutorial, they ultimately land on
+the docs site, GitHub, StackOverflow, or some communication platform in the
+community. It's at this point that developers can easily lose the motivation if
+the the community and docs lacks proper getting started materials or worse, a
+silent community. This is how I categorize the developer experience (aka
+devex), which aims to improve both the user and contributor experiences in the
+developer community by
+[empower decision making by doing](https://en.wikipedia.org/wiki/Experiential_learning),
+[removing inefficiencies](https://trino.io/blog/2023/01/09/cleaning-up-the-trino-backlog),
+and as we'll cover here, exposing untapped opportunities. 
+
+Much like any open source project, maintainers on the Trino project struggle at
+communicating our lack of proper resources to build and test new features built
+on various proprietary software. For those less familiar, Trino is a federated
+query engine with
+[multiple data sources](https://trino.io/docs/current/connector.html). Trino
+tests integrations with open data sources by running small local instances of
+the connecting system. Snowflake is a proprietary cloud-native data warehouse,
+also known as cloud data platform, that had no way to test this integration
+that was 
+[eagerly](https://github.com/trinodb/trino/pull/2551#issuecomment-873082280)
+[sought](https://github.com/trinodb/trino/issues/1863)
+[by many](https://github.com/trinodb/trino/issues/7247). After an
+[initial attempt](https://github.com/trinodb/trino/pull/2551) by my friend
+[Phillipe Gagnon](https://www.linkedin.com/in/pfgagnon), a similar pattern
+emerged [with the second pull request](https://github.com/trinodb/trino/pull/10387)
+where the development velocity started strong and after some months stagnated.
+
+### Cognitive surplus and communication deficit
+
+One of the most unfortunate repeated issues I see in the open source projects
+I've contributed to and helped maintain is that various well-known larger
+objectives known among the core group often move so fast, while individual
+updates that don't fit in a larger project narrative has higher likelihood to
+get lost in the shuffle. As an open source project grows, you end up with a
+cognitive surplus in the form of an abundance of bright people willing to share
+their time, intellect, and experience with a larger community. 
+
+Often both contributors and maintainers are so busy with their day jobs,
+families, and self care, that they dedicate most of their remaining energy to
+ensuring they write quality code and tests to the best of their ability. Lack of
+upfront communication to validate ideas from newer contributors, and lack of
+communication by maintainers who see a large number of issues to address are
+two communication issues that stagnate a project. Maintainers are often doers
+that see more value in addressing quick-win work that flows from the
+well-established contributors of the project. Followthrough on either side can
+be difficult as newcomers don't want to be rude and maintainers accidentally
+forget or hope someone else will take the time to address the issues on that
+pull request. 
+
+![](https://i.snap.as/kbwKpWa7.webp)
+
+Waiting for your work to be reviewed by someone in the community kind of works
+like a wishing well, you toss in a coin (i.e. your time and effort represented
+as code and a pull request) and hope your wish of getting your code reviewed
+and merged comes true. The satisfaction of hypothetical developer(s) that
+benefit from your small and significant change floods your mind and you feel
+like you’ve improved humanity just that one little bit more. 
+
+Maintainers are in a constant state of pulling triage on all the surplus of
+innovation being thrown at them and simultaneously trying to look for more help
+reviewing and being the expert at some areas of the code. As you can imagine,
+good communication can be hard to come by as many newcomers are strangers and
+concerned they are wasting precious time by asking too many questions rather
+than just showing a proof of concept. This backfires for multiple reasons that
+go outside of the scope of this post.
+
+### History repeats itself, until it doesn't
+
+It became apparent that each time there was 
+[a discussion](https://github.com/trinodb/trino/pull/2551#issuecomment-709220790)
+for how to do
+[integration testing](https://github.com/trinodb/trino/pull/10387#issuecomment-1008430060)
+there was no good way to test a Snowflake instance with the lack of funding for
+the project. Trino has a high bar for quality and none of the maintainers felt
+it was a risk worth taking due to the likely popularity of the integration and
+likelihood of future maintenance issues. Once each pull request hit this same
+fate, it stalled with nobody really knowing how to resolve the real issue of
+limited financial resources of the
+[Trino Software Foundation (TSF)](https://trino.io/foundation.html). It’s never
+fun to mention that you can’t move forward on work with constraints like these
+and yet nobody has the time to look into a monetary solution.
+
+Noticing that Teng had already done a significant amount of work to contribute
+his Snowflake connector, I reached out to him to see if we could brainstorm a
+solution. Not long after, Erik Anderson also reached out to get my thoughts on
+how to go about contributing Bloomberg's Snowflake connector. Great, now we have
+two connector implementations and no solution to getting the infrastructure to
+get them tested. During the first
+[Trino Contributor Congregation](https://trino.io/blog/2022/11/21/trino-summit-2022-recap.html#trino-contributor-congregation),
+Erik and I brought up Bloomberg's desire to contribute a Snowflake connector and
+I articulated the testing issue. What’s ironic was this was the first time I had
+actually thoroughly articulated the issue to Erik as well.
+
+As soon as I was done, Erik requested the mic said something to the effect of,
+"Oh I wish I would have known that's the problem, the solution is simple,
+Bloomberg will provide the TSF a Snowflake account."
+
+Done!
+
+Just as in business, **you can never underestimate the power of communication in
+an open source project** as well. Shortly after Erik, Teng, and I discussed the
+best ways to merge their work, set up the Snowflake accounts for Trino
+maintainers, and start the arduous process of building a thorough test suite
+with the help of Yuya, Piotr, Manfred, and Martin.
+
+## The long road to Snowflake
+
+As Teng and Erik merged their efforts, the process was anything but
+straightforward. There were setbacks, vacations, meticulous reviews, and
+infrastructure issues. But the perseverance of everyone involved was unwavering.
+
+Bloomberg started by creating
+[an official Bloomberg Trino repository](https://github.com/bloomberg/trino)
+originally as a means for Teng and Erik to mesh their solutions together and
+build the testing infrastructure that relied on Bloomberg resources. Without
+needing to rely on the main Trino project to merge incremental solutions, they
+were able to quickly iterate the early solutions. This repository also
+facilitated Bloomberg’s now numerous contributions to Trino.
+
+It took a few months just to get the 
+ForePaaS<a name="fn1"></a><sup><a class="footnote" href="#fnref1">1</a></sup>
+and Bloomberg solutions merged. There were valuable takes from each system and
+better integration tests were written with the new testing infrastructure. The
+two Snowflake connector implementations were merged together by April of 2023.
+Finally, the reviews could start. Once the initial two passes happened we
+anticipated that we would see the Snowflake connector release in the summer of
+2023 near Trino Fest. So much so, that we planned
+[a talk with Erik and Teng](https://trino.io/blog/2023/07/12/trino-fest-2023-let-it-snow-recap)
+initially as a reveal assuming the pull request would be merged by then. Lo and
+behold, this didn’t happen, as there were still a lot of concerns around use
+cases not being properly tested.
+
+### The Halting Review Problem
+
+A necessary evil that comes with pull request reviews and more broadly,
+distributed consensus is that reviews can drag on over time. This can lead to
+[countless number of updates](https://github.com/trinodb/trino/pull/17909#issuecomment-1841809727)
+you have to make to your changes to accommodate the ever changing project]
+shifting beneath your feet as you simultaneously try to make progress on
+[suggestions from those reviewing your code](https://github.com/trinodb/trino/pull/17909#pullrequestreview-1793724311).
+
+Many critics of open source like to point this out as a drawback of this model,
+when in fact, this same problem exists in closed-source systems, but closed
+source projects can generally delay difficult decisions to make fast upfront
+progress to meet certain deadlines. This may be seen as an advantage at first,
+but as many developers can attest, this simply leads to technical debt and
+fragile products in most environments that struggle to prioritize a healthy
+codebase.
+
+Regardless, having to face these larger discussions upfront can induce fatigue,
+especially when managing external circumstances; personal affairs, a project at
+work - you know, the entity that pays these engineers - or countless other
+factors will rear their ugly heads and
+[progress will stagger](https://github.com/trinodb/trino/pull/17909#discussion_r1418149737)
+with ebbs and flows of attention. This can be really dangerous territory and
+commonly resolves in contributors and reviewers abandoning the PR when it stalls.
+
+This is why I believe open source, while not beholden to any timelines, needs a
+sort of project/product management role which is currently covered often by
+project leaders and DevEx engineers. This can also relieve tension between the
+needs of open source and big businesses in the community with real deadlines, at
+least keeping the communication consistent while ensuring bugs and design flaws
+aren’t introduced to the code base.
+
+## What’s in it for Bloomberg and ForePaaS?
+
+If you’ve never worked in open source or for a company that contributes to open
+source, you may be thinking how the heck do these engineers convince their
+leadership to let them dump so much time into these contributions? The simple
+answer is, it’s good for business.
+
+If we peep into why Bloomberg uses Trino, they aggregate data from an unusually
+large number of data sources across their customers who use their services. Part
+of this requires them to merge the customer’s dataset with existing aggregate
+data in Bloomberg’s product. Since Trino can connect to most customer databases
+out-of-the-box, this requires Bloomberg to manage a small array of custom
+connectors that provide their services to customers as multiple catalogs in a
+single convention SQL endpoint. Having engineers maintain a few small connectors
+rather than an entire distributed query engine themselves saves a lot of time
+and maintenance.
+
+Despite how many problems Trino already solves for them, Bloomberg ideally still
+wants to maintain as few things as possible to relieve their engineer’s time.
+Luckily, open source projects are generally more than happy to accept features
+that the community all benefit from. This doesn’t mean we shouldn’t appreciate
+when companies contribute. This generosity and forward-thinking approach enabled
+Erik and Teng to combine their battle-tested connectors, crafting a high value
+creation for the community.
+
+Yet another issue that has come up much more recently is the
+[XZ exploit](https://www.darkreading.com/application-security/xz-utils-scare-exposes-hard-truths-in-software-security)
+where attackers guised as well-meaning open source contributors who slowly put
+a back door in place to enable hacking any company using this software.
+Thankfully,
+[backdoors only work when they are unnoticeable](https://opensourcesecurity.io/2019/08/28/backdoors-in-open-source-are-here-to-stay/)
+and due to the far spread testing of open source, a Microsoft employee noticed
+a performance issue that led them to find the slowly crafted exploit. Some see
+this as a reason to avoid open source, while on the contrary, you’ll simply
+never know how many exploits live in closed-source technology, and the number of
+eyeballs able to find those exploits are far smaller. In this spirit, using and
+contributing to open source becomes the safer alternative, especially as we
+learn how to improve detection of exploits on public open source systems. Just
+as it will seem odd that we ever relied on gas-powered vehicles one day, it will
+also seem odd that we ever trusted software who’s dependencies we weren’t able
+to inspect.
+
+<iframe width="560" height="315" src="https://www.youtube.com/embed/4NAU-UyiJ8A" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
+
+This takeaway is important for developers who want to contribute to open source,
+it is imperative that developers convey a compelling story with evidence when
+getting involved in open source. I don’t imagine a world of altruists, but I do
+imagine an economic incentive model that helps us help each other much more than
+in the past.
+
+## Esprit de Corps
+
+The marines use the mantra, “Esprit de Corps,” latin for “spirit of the people”,
+which I mistakenly took the “Corps” part for the Marine Corps rather than the
+more general meaning of a body or group of people. In fact, it expresses [the 
+common spirit existing in the members of a group and inspiring enthusiasm,
+devotion, and strong regard for the honor of the 
+group.](https://www.merriam-webster.com/dictionary/esprit%20de%20corps) Any time
+I see this type of shared and selfless cooperation in open source, I’m reminded
+of the bond, friendships, and care of me and my fellow marines in times of war.
+Despite the unfortunate political circumstances of our mission, I do treasure
+the shared companionship with both my fellow marines and the local Iraqi people.
+There is ultimately a power in the gathering of many when aimed for building an
+altruistic means of improving each others lives.
+
+In the same way, demonstration of human cooperation is about more than just
+developing a connector; it's about the shared experiences, the friendships
+forged, and the skills honed in the pursuit of a common goal. The successful
+addition of the Trino Snowflake connector is a testament to the positive sum
+outcomes of open source collaboration. This journey has been about
+collaboration, learning, and growth that will benefit many.
+
+Another effect is that once the upfront hard work like this is done, it makes
+way for many valuable iterations like [adding Top-N 
+support](https://github.com/trinodb/trino/pull/21219)(Shoppee), [adding 
+Snowflake Iceberg REST catalog support](https://github.com/trinodb/trino/pull/21365)
+(Starburst), and [adding better type mapping 
+support](https://github.com/trinodb/trino/pull/21365)(Apple) for the Snowflake
+integration. I love showcasing this trailblazing and yes, altruistic work Erik,
+Teng, Yuya, Martin, Manfred, and Piotr - and everyone who helped in the Trino
+community. A special thanks to the managers and leadership at Bloomberg and
+ForePaaS for their generous commitment of time and resources.
+
+As we celebrate this milestone, we're already looking forward to the next
+adventure. Here's to federating them all, together!
+
+================================================================================
+Notes:
+<a name="fnref1"></a><sup><a class="footnote-ref" href="#fn1">1</a></sup>
+<span class="footnote-ref-text">ForePaaS has been integrated into OVHCloud, which is renamed as Data Platform.</span>
+
+_bits_