-
Notifications
You must be signed in to change notification settings - Fork 90
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add the Trino Snowflake connector story post.
- Loading branch information
1 parent
a434fff
commit c18b1b8
Showing
1 changed file
with
294 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,294 @@ | ||
--- | ||
layout: post | ||
title: "Integrating Trino and Snowflake: An open source success story" | ||
author: Brian Olsen | ||
excerpt_separator: <!--more--> | ||
canonical_url: https://bitsondata.dev/trino-snowflake-bloomberg-oss-win | ||
--- | ||
|
||
We’re seeing open source usher in a challenge to the economic model where the | ||
success metric is increasing the commonwealth of economic capital. This | ||
acceleration comes from playing positive-sum games with friends online and | ||
avoiding limiting a community to a vision that only benefits a small number of | ||
corporations or individuals. It’s hard to imagine how to embed such frameworks | ||
within our current zero-sum winner-takes-all economic system. There’s certainly | ||
no shortage of heated debates around how to construct a harmonious relationship | ||
between the open source community and companies participating in them. Something | ||
we don’t talk about enough are the positive examples of when a coordinated | ||
effort in open source sticks the landing and so many benefit from it. | ||
|
||
This post highlights the extraordinary contributions of | ||
[Erik Anderson](https://www.linkedin.com/in/erikanderson/), | ||
[Teng Yu](https://www.linkedin.com/in/tyu-fr/), | ||
[Yuya Ebihara](https://www.linkedin.com/in/ebyhr/), and the broader | ||
[Trino community](https://github.com/trinodb/trino) to finally contribute the | ||
long-coveted | ||
[Trino Snowflake Connector](https://trino.io/docs/current/connector/snowflake.html). | ||
|
||
It is both a success story and some wisdom for corporations using and providing | ||
services around open source should be involved in open source . This provides a | ||
blue print for those who want to contribute to open source the types of | ||
patterns to follow when contributing. | ||
|
||
## A common challenge in open source | ||
|
||
Despite my love for marketing and educating a community on tech (aka | ||
[edutainment](https://en.wikipedia.org/wiki/Educational_entertainment)), it’s | ||
only the first part of the equation of what I aim to do as a developer advocate. | ||
Once developers see some exciting video or tutorial, they ultimately land on | ||
the docs site, GitHub, StackOverflow, or some communication platform in the | ||
community. It's at this point that developers can easily lose the motivation if | ||
the the community and docs lacks proper getting started materials or worse, a | ||
silent community. This is how I categorize the developer experience (aka | ||
devex), which aims to improve both the user and contributor experiences in the | ||
developer community by | ||
[empower decision making by doing](https://en.wikipedia.org/wiki/Experiential_learning), | ||
[removing inefficiencies](https://trino.io/blog/2023/01/09/cleaning-up-the-trino-backlog), | ||
and as we'll cover here, exposing untapped opportunities. | ||
|
||
Much like any open source project, maintainers on the Trino project struggle at | ||
communicating our lack of proper resources to build and test new features built | ||
on various proprietary software. For those less familiar, Trino is a federated | ||
query engine with | ||
[multiple data sources](https://trino.io/docs/current/connector.html). Trino | ||
tests integrations with open data sources by running small local instances of | ||
the connecting system. Snowflake is a proprietary cloud-native data warehouse, | ||
also known as cloud data platform, that had no way to test this integration | ||
that was | ||
[eagerly](https://github.com/trinodb/trino/pull/2551#issuecomment-873082280) | ||
[sought](https://github.com/trinodb/trino/issues/1863) | ||
[by many](https://github.com/trinodb/trino/issues/7247). After an | ||
[initial attempt](https://github.com/trinodb/trino/pull/2551) by my friend | ||
[Phillipe Gagnon](https://www.linkedin.com/in/pfgagnon), a similar pattern | ||
emerged [with the second pull request](https://github.com/trinodb/trino/pull/10387) | ||
where the development velocity started strong and after some months stagnated. | ||
|
||
### Cognitive surplus and communication deficit | ||
|
||
One of the most unfortunate repeated issues I see in the open source projects | ||
I've contributed to and helped maintain is that various well-known larger | ||
objectives known among the core group often move so fast, while individual | ||
updates that don't fit in a larger project narrative has higher likelihood to | ||
get lost in the shuffle. As an open source project grows, you end up with a | ||
cognitive surplus in the form of an abundance of bright people willing to share | ||
their time, intellect, and experience with a larger community. | ||
|
||
Often both contributors and maintainers are so busy with their day jobs, | ||
families, and self care, that they dedicate most of their remaining energy to | ||
ensuring they write quality code and tests to the best of their ability. Lack of | ||
upfront communication to validate ideas from newer contributors, and lack of | ||
communication by maintainers who see a large number of issues to address are | ||
two communication issues that stagnate a project. Maintainers are often doers | ||
that see more value in addressing quick-win work that flows from the | ||
well-established contributors of the project. Followthrough on either side can | ||
be difficult as newcomers don't want to be rude and maintainers accidentally | ||
forget or hope someone else will take the time to address the issues on that | ||
pull request. | ||
|
||
![](https://i.snap.as/kbwKpWa7.webp) | ||
|
||
Waiting for your work to be reviewed by someone in the community kind of works | ||
like a wishing well, you toss in a coin (i.e. your time and effort represented | ||
as code and a pull request) and hope your wish of getting your code reviewed | ||
and merged comes true. The satisfaction of hypothetical developer(s) that | ||
benefit from your small and significant change floods your mind and you feel | ||
like you’ve improved humanity just that one little bit more. | ||
|
||
Maintainers are in a constant state of pulling triage on all the surplus of | ||
innovation being thrown at them and simultaneously trying to look for more help | ||
reviewing and being the expert at some areas of the code. As you can imagine, | ||
good communication can be hard to come by as many newcomers are strangers and | ||
concerned they are wasting precious time by asking too many questions rather | ||
than just showing a proof of concept. This backfires for multiple reasons that | ||
go outside of the scope of this post. | ||
|
||
### History repeats itself, until it doesn't | ||
|
||
It became apparent that each time there was | ||
[a discussion](https://github.com/trinodb/trino/pull/2551#issuecomment-709220790) | ||
for how to do | ||
[integration testing](https://github.com/trinodb/trino/pull/10387#issuecomment-1008430060) | ||
there was no good way to test a Snowflake instance with the lack of funding for | ||
the project. Trino has a high bar for quality and none of the maintainers felt | ||
it was a risk worth taking due to the likely popularity of the integration and | ||
likelihood of future maintenance issues. Once each pull request hit this same | ||
fate, it stalled with nobody really knowing how to resolve the real issue of | ||
limited financial resources of the | ||
[Trino Software Foundation (TSF)](https://trino.io/foundation.html). It’s never | ||
fun to mention that you can’t move forward on work with constraints like these | ||
and yet nobody has the time to look into a monetary solution. | ||
|
||
Noticing that Teng had already done a significant amount of work to contribute | ||
his Snowflake connector, I reached out to him to see if we could brainstorm a | ||
solution. Not long after, Erik Anderson also reached out to get my thoughts on | ||
how to go about contributing Bloomberg's Snowflake connector. Great, now we have | ||
two connector implementations and no solution to getting the infrastructure to | ||
get them tested. During the first | ||
[Trino Contributor Congregation](https://trino.io/blog/2022/11/21/trino-summit-2022-recap.html#trino-contributor-congregation), | ||
Erik and I brought up Bloomberg's desire to contribute a Snowflake connector and | ||
I articulated the testing issue. What’s ironic was this was the first time I had | ||
actually thoroughly articulated the issue to Erik as well. | ||
|
||
As soon as I was done, Erik requested the mic said something to the effect of, | ||
"Oh I wish I would have known that's the problem, the solution is simple, | ||
Bloomberg will provide the TSF a Snowflake account." | ||
|
||
Done! | ||
|
||
Just as in business, **you can never underestimate the power of communication in | ||
an open source project** as well. Shortly after Erik, Teng, and I discussed the | ||
best ways to merge their work, set up the Snowflake accounts for Trino | ||
maintainers, and start the arduous process of building a thorough test suite | ||
with the help of Yuya, Piotr, Manfred, and Martin. | ||
|
||
## The long road to Snowflake | ||
|
||
As Teng and Erik merged their efforts, the process was anything but | ||
straightforward. There were setbacks, vacations, meticulous reviews, and | ||
infrastructure issues. But the perseverance of everyone involved was unwavering. | ||
|
||
Bloomberg started by creating | ||
[an official Bloomberg Trino repository](https://github.com/bloomberg/trino) | ||
originally as a means for Teng and Erik to mesh their solutions together and | ||
build the testing infrastructure that relied on Bloomberg resources. Without | ||
needing to rely on the main Trino project to merge incremental solutions, they | ||
were able to quickly iterate the early solutions. This repository also | ||
facilitated Bloomberg’s now numerous contributions to Trino. | ||
|
||
It took a few months just to get the | ||
ForePaaS<a name="fn1"></a><sup><a class="footnote" href="#fnref1">1</a></sup> | ||
and Bloomberg solutions merged. There were valuable takes from each system and | ||
better integration tests were written with the new testing infrastructure. The | ||
two Snowflake connector implementations were merged together by April of 2023. | ||
Finally, the reviews could start. Once the initial two passes happened we | ||
anticipated that we would see the Snowflake connector release in the summer of | ||
2023 near Trino Fest. So much so, that we planned | ||
[a talk with Erik and Teng](https://trino.io/blog/2023/07/12/trino-fest-2023-let-it-snow-recap) | ||
initially as a reveal assuming the pull request would be merged by then. Lo and | ||
behold, this didn’t happen, as there were still a lot of concerns around use | ||
cases not being properly tested. | ||
|
||
### The Halting Review Problem | ||
|
||
A necessary evil that comes with pull request reviews and more broadly, | ||
distributed consensus is that reviews can drag on over time. This can lead to | ||
[countless number of updates](https://github.com/trinodb/trino/pull/17909#issuecomment-1841809727) | ||
you have to make to your changes to accommodate the ever changing project] | ||
shifting beneath your feet as you simultaneously try to make progress on | ||
[suggestions from those reviewing your code](https://github.com/trinodb/trino/pull/17909#pullrequestreview-1793724311). | ||
|
||
Many critics of open source like to point this out as a drawback of this model, | ||
when in fact, this same problem exists in closed-source systems, but closed | ||
source projects can generally delay difficult decisions to make fast upfront | ||
progress to meet certain deadlines. This may be seen as an advantage at first, | ||
but as many developers can attest, this simply leads to technical debt and | ||
fragile products in most environments that struggle to prioritize a healthy | ||
codebase. | ||
|
||
Regardless, having to face these larger discussions upfront can induce fatigue, | ||
especially when managing external circumstances; personal affairs, a project at | ||
work - you know, the entity that pays these engineers - or countless other | ||
factors will rear their ugly heads and | ||
[progress will stagger](https://github.com/trinodb/trino/pull/17909#discussion_r1418149737) | ||
with ebbs and flows of attention. This can be really dangerous territory and | ||
commonly resolves in contributors and reviewers abandoning the PR when it stalls. | ||
|
||
This is why I believe open source, while not beholden to any timelines, needs a | ||
sort of project/product management role which is currently covered often by | ||
project leaders and DevEx engineers. This can also relieve tension between the | ||
needs of open source and big businesses in the community with real deadlines, at | ||
least keeping the communication consistent while ensuring bugs and design flaws | ||
aren’t introduced to the code base. | ||
|
||
## What’s in it for Bloomberg and ForePaaS? | ||
|
||
If you’ve never worked in open source or for a company that contributes to open | ||
source, you may be thinking how the heck do these engineers convince their | ||
leadership to let them dump so much time into these contributions? The simple | ||
answer is, it’s good for business. | ||
|
||
If we peep into why Bloomberg uses Trino, they aggregate data from an unusually | ||
large number of data sources across their customers who use their services. Part | ||
of this requires them to merge the customer’s dataset with existing aggregate | ||
data in Bloomberg’s product. Since Trino can connect to most customer databases | ||
out-of-the-box, this requires Bloomberg to manage a small array of custom | ||
connectors that provide their services to customers as multiple catalogs in a | ||
single convention SQL endpoint. Having engineers maintain a few small connectors | ||
rather than an entire distributed query engine themselves saves a lot of time | ||
and maintenance. | ||
|
||
Despite how many problems Trino already solves for them, Bloomberg ideally still | ||
wants to maintain as few things as possible to relieve their engineer’s time. | ||
Luckily, open source projects are generally more than happy to accept features | ||
that the community all benefit from. This doesn’t mean we shouldn’t appreciate | ||
when companies contribute. This generosity and forward-thinking approach enabled | ||
Erik and Teng to combine their battle-tested connectors, crafting a high value | ||
creation for the community. | ||
|
||
Yet another issue that has come up much more recently is the | ||
[XZ exploit](https://www.darkreading.com/application-security/xz-utils-scare-exposes-hard-truths-in-software-security) | ||
where attackers guised as well-meaning open source contributors who slowly put | ||
a back door in place to enable hacking any company using this software. | ||
Thankfully, | ||
[backdoors only work when they are unnoticeable](https://opensourcesecurity.io/2019/08/28/backdoors-in-open-source-are-here-to-stay/) | ||
and due to the far spread testing of open source, a Microsoft employee noticed | ||
a performance issue that led them to find the slowly crafted exploit. Some see | ||
this as a reason to avoid open source, while on the contrary, you’ll simply | ||
never know how many exploits live in closed-source technology, and the number of | ||
eyeballs able to find those exploits are far smaller. In this spirit, using and | ||
contributing to open source becomes the safer alternative, especially as we | ||
learn how to improve detection of exploits on public open source systems. Just | ||
as it will seem odd that we ever relied on gas-powered vehicles one day, it will | ||
also seem odd that we ever trusted software who’s dependencies we weren’t able | ||
to inspect. | ||
|
||
<iframe width="560" height="315" src="https://www.youtube.com/embed/4NAU-UyiJ8A" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> | ||
|
||
This takeaway is important for developers who want to contribute to open source, | ||
it is imperative that developers convey a compelling story with evidence when | ||
getting involved in open source. I don’t imagine a world of altruists, but I do | ||
imagine an economic incentive model that helps us help each other much more than | ||
in the past. | ||
|
||
## Esprit de Corps | ||
|
||
The marines use the mantra, “Esprit de Corps,” latin for “spirit of the people”, | ||
which I mistakenly took the “Corps” part for the Marine Corps rather than the | ||
more general meaning of a body or group of people. In fact, it expresses [the | ||
common spirit existing in the members of a group and inspiring enthusiasm, | ||
devotion, and strong regard for the honor of the | ||
group.](https://www.merriam-webster.com/dictionary/esprit%20de%20corps) Any time | ||
I see this type of shared and selfless cooperation in open source, I’m reminded | ||
of the bond, friendships, and care of me and my fellow marines in times of war. | ||
Despite the unfortunate political circumstances of our mission, I do treasure | ||
the shared companionship with both my fellow marines and the local Iraqi people. | ||
There is ultimately a power in the gathering of many when aimed for building an | ||
altruistic means of improving each others lives. | ||
|
||
In the same way, demonstration of human cooperation is about more than just | ||
developing a connector; it's about the shared experiences, the friendships | ||
forged, and the skills honed in the pursuit of a common goal. The successful | ||
addition of the Trino Snowflake connector is a testament to the positive sum | ||
outcomes of open source collaboration. This journey has been about | ||
collaboration, learning, and growth that will benefit many. | ||
|
||
Another effect is that once the upfront hard work like this is done, it makes | ||
way for many valuable iterations like [adding Top-N | ||
support](https://github.com/trinodb/trino/pull/21219)(Shoppee), [adding | ||
Snowflake Iceberg REST catalog support](https://github.com/trinodb/trino/pull/21365) | ||
(Starburst), and [adding better type mapping | ||
support](https://github.com/trinodb/trino/pull/21365)(Apple) for the Snowflake | ||
integration. I love showcasing this trailblazing and yes, altruistic work Erik, | ||
Teng, Yuya, Martin, Manfred, and Piotr - and everyone who helped in the Trino | ||
community. A special thanks to the managers and leadership at Bloomberg and | ||
ForePaaS for their generous commitment of time and resources. | ||
|
||
As we celebrate this milestone, we're already looking forward to the next | ||
adventure. Here's to federating them all, together! | ||
|
||
================================================================================ | ||
Notes: | ||
<a name="fnref1"></a><sup><a class="footnote-ref" href="#fn1">1</a></sup> | ||
<span class="footnote-ref-text">ForePaaS has been integrated into OVHCloud, which is renamed as Data Platform.</span> | ||
|
||
_bits_ |