Skip to content

Commit

Permalink
Add the Trino Snowflake connector story post.
Browse files Browse the repository at this point in the history
  • Loading branch information
bitsondatadev committed May 3, 2024
1 parent a434fff commit c18b1b8
Showing 1 changed file with 294 additions and 0 deletions.
294 changes: 294 additions & 0 deletions _posts/2024-05-03-trino-snowflake-bloomberg-oss-win.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,294 @@
---
layout: post
title: "Integrating Trino and Snowflake: An open source success story"
author: Brian Olsen
excerpt_separator: <!--more-->
canonical_url: https://bitsondata.dev/trino-snowflake-bloomberg-oss-win
---

We’re seeing open source usher in a challenge to the economic model where the
success metric is increasing the commonwealth of economic capital. This
acceleration comes from playing positive-sum games with friends online and
avoiding limiting a community to a vision that only benefits a small number of
corporations or individuals. It’s hard to imagine how to embed such frameworks
within our current zero-sum winner-takes-all economic system. There’s certainly
no shortage of heated debates around how to construct a harmonious relationship
between the open source community and companies participating in them. Something
we don’t talk about enough are the positive examples of when a coordinated
effort in open source sticks the landing and so many benefit from it.

This post highlights the extraordinary contributions of
[Erik Anderson](https://www.linkedin.com/in/erikanderson/),
[Teng Yu](https://www.linkedin.com/in/tyu-fr/),
[Yuya Ebihara](https://www.linkedin.com/in/ebyhr/), and the broader
[Trino community](https://github.com/trinodb/trino) to finally contribute the
long-coveted
[Trino Snowflake Connector](https://trino.io/docs/current/connector/snowflake.html).

It is both a success story and some wisdom for corporations using and providing
services around open source should be involved in open source . This provides a
blue print for those who want to contribute to open source the types of
patterns to follow when contributing.

## A common challenge in open source

Despite my love for marketing and educating a community on tech (aka
[edutainment](https://en.wikipedia.org/wiki/Educational_entertainment)), it’s
only the first part of the equation of what I aim to do as a developer advocate.
Once developers see some exciting video or tutorial, they ultimately land on
the docs site, GitHub, StackOverflow, or some communication platform in the
community. It's at this point that developers can easily lose the motivation if
the the community and docs lacks proper getting started materials or worse, a
silent community. This is how I categorize the developer experience (aka
devex), which aims to improve both the user and contributor experiences in the
developer community by
[empower decision making by doing](https://en.wikipedia.org/wiki/Experiential_learning),
[removing inefficiencies](https://trino.io/blog/2023/01/09/cleaning-up-the-trino-backlog),
and as we'll cover here, exposing untapped opportunities.

Much like any open source project, maintainers on the Trino project struggle at
communicating our lack of proper resources to build and test new features built
on various proprietary software. For those less familiar, Trino is a federated
query engine with
[multiple data sources](https://trino.io/docs/current/connector.html). Trino
tests integrations with open data sources by running small local instances of
the connecting system. Snowflake is a proprietary cloud-native data warehouse,
also known as cloud data platform, that had no way to test this integration
that was
[eagerly](https://github.com/trinodb/trino/pull/2551#issuecomment-873082280)
[sought](https://github.com/trinodb/trino/issues/1863)
[by many](https://github.com/trinodb/trino/issues/7247). After an
[initial attempt](https://github.com/trinodb/trino/pull/2551) by my friend
[Phillipe Gagnon](https://www.linkedin.com/in/pfgagnon), a similar pattern
emerged [with the second pull request](https://github.com/trinodb/trino/pull/10387)
where the development velocity started strong and after some months stagnated.

### Cognitive surplus and communication deficit

One of the most unfortunate repeated issues I see in the open source projects
I've contributed to and helped maintain is that various well-known larger
objectives known among the core group often move so fast, while individual
updates that don't fit in a larger project narrative has higher likelihood to
get lost in the shuffle. As an open source project grows, you end up with a
cognitive surplus in the form of an abundance of bright people willing to share
their time, intellect, and experience with a larger community.

Often both contributors and maintainers are so busy with their day jobs,
families, and self care, that they dedicate most of their remaining energy to
ensuring they write quality code and tests to the best of their ability. Lack of
upfront communication to validate ideas from newer contributors, and lack of
communication by maintainers who see a large number of issues to address are
two communication issues that stagnate a project. Maintainers are often doers
that see more value in addressing quick-win work that flows from the
well-established contributors of the project. Followthrough on either side can
be difficult as newcomers don't want to be rude and maintainers accidentally
forget or hope someone else will take the time to address the issues on that
pull request.

![](https://i.snap.as/kbwKpWa7.webp)

Waiting for your work to be reviewed by someone in the community kind of works
like a wishing well, you toss in a coin (i.e. your time and effort represented
as code and a pull request) and hope your wish of getting your code reviewed
and merged comes true. The satisfaction of hypothetical developer(s) that
benefit from your small and significant change floods your mind and you feel
like you’ve improved humanity just that one little bit more.

Maintainers are in a constant state of pulling triage on all the surplus of
innovation being thrown at them and simultaneously trying to look for more help
reviewing and being the expert at some areas of the code. As you can imagine,
good communication can be hard to come by as many newcomers are strangers and
concerned they are wasting precious time by asking too many questions rather
than just showing a proof of concept. This backfires for multiple reasons that
go outside of the scope of this post.

### History repeats itself, until it doesn't

It became apparent that each time there was
[a discussion](https://github.com/trinodb/trino/pull/2551#issuecomment-709220790)
for how to do
[integration testing](https://github.com/trinodb/trino/pull/10387#issuecomment-1008430060)
there was no good way to test a Snowflake instance with the lack of funding for
the project. Trino has a high bar for quality and none of the maintainers felt
it was a risk worth taking due to the likely popularity of the integration and
likelihood of future maintenance issues. Once each pull request hit this same
fate, it stalled with nobody really knowing how to resolve the real issue of
limited financial resources of the
[Trino Software Foundation (TSF)](https://trino.io/foundation.html). It’s never
fun to mention that you can’t move forward on work with constraints like these
and yet nobody has the time to look into a monetary solution.

Noticing that Teng had already done a significant amount of work to contribute
his Snowflake connector, I reached out to him to see if we could brainstorm a
solution. Not long after, Erik Anderson also reached out to get my thoughts on
how to go about contributing Bloomberg's Snowflake connector. Great, now we have
two connector implementations and no solution to getting the infrastructure to
get them tested. During the first
[Trino Contributor Congregation](https://trino.io/blog/2022/11/21/trino-summit-2022-recap.html#trino-contributor-congregation),
Erik and I brought up Bloomberg's desire to contribute a Snowflake connector and
I articulated the testing issue. What’s ironic was this was the first time I had
actually thoroughly articulated the issue to Erik as well.

As soon as I was done, Erik requested the mic said something to the effect of,
"Oh I wish I would have known that's the problem, the solution is simple,
Bloomberg will provide the TSF a Snowflake account."

Done!

Just as in business, **you can never underestimate the power of communication in
an open source project** as well. Shortly after Erik, Teng, and I discussed the
best ways to merge their work, set up the Snowflake accounts for Trino
maintainers, and start the arduous process of building a thorough test suite
with the help of Yuya, Piotr, Manfred, and Martin.

## The long road to Snowflake

As Teng and Erik merged their efforts, the process was anything but
straightforward. There were setbacks, vacations, meticulous reviews, and
infrastructure issues. But the perseverance of everyone involved was unwavering.

Bloomberg started by creating
[an official Bloomberg Trino repository](https://github.com/bloomberg/trino)
originally as a means for Teng and Erik to mesh their solutions together and
build the testing infrastructure that relied on Bloomberg resources. Without
needing to rely on the main Trino project to merge incremental solutions, they
were able to quickly iterate the early solutions. This repository also
facilitated Bloomberg’s now numerous contributions to Trino.

It took a few months just to get the
ForePaaS<a name="fn1"></a><sup><a class="footnote" href="#fnref1">1</a></sup>
and Bloomberg solutions merged. There were valuable takes from each system and
better integration tests were written with the new testing infrastructure. The
two Snowflake connector implementations were merged together by April of 2023.
Finally, the reviews could start. Once the initial two passes happened we
anticipated that we would see the Snowflake connector release in the summer of
2023 near Trino Fest. So much so, that we planned
[a talk with Erik and Teng](https://trino.io/blog/2023/07/12/trino-fest-2023-let-it-snow-recap)
initially as a reveal assuming the pull request would be merged by then. Lo and
behold, this didn’t happen, as there were still a lot of concerns around use
cases not being properly tested.

### The Halting Review Problem

A necessary evil that comes with pull request reviews and more broadly,
distributed consensus is that reviews can drag on over time. This can lead to
[countless number of updates](https://github.com/trinodb/trino/pull/17909#issuecomment-1841809727)
you have to make to your changes to accommodate the ever changing project]
shifting beneath your feet as you simultaneously try to make progress on
[suggestions from those reviewing your code](https://github.com/trinodb/trino/pull/17909#pullrequestreview-1793724311).

Many critics of open source like to point this out as a drawback of this model,
when in fact, this same problem exists in closed-source systems, but closed
source projects can generally delay difficult decisions to make fast upfront
progress to meet certain deadlines. This may be seen as an advantage at first,
but as many developers can attest, this simply leads to technical debt and
fragile products in most environments that struggle to prioritize a healthy
codebase.

Regardless, having to face these larger discussions upfront can induce fatigue,
especially when managing external circumstances; personal affairs, a project at
work - you know, the entity that pays these engineers - or countless other
factors will rear their ugly heads and
[progress will stagger](https://github.com/trinodb/trino/pull/17909#discussion_r1418149737)
with ebbs and flows of attention. This can be really dangerous territory and
commonly resolves in contributors and reviewers abandoning the PR when it stalls.

This is why I believe open source, while not beholden to any timelines, needs a
sort of project/product management role which is currently covered often by
project leaders and DevEx engineers. This can also relieve tension between the
needs of open source and big businesses in the community with real deadlines, at
least keeping the communication consistent while ensuring bugs and design flaws
aren’t introduced to the code base.

## What’s in it for Bloomberg and ForePaaS?

If you’ve never worked in open source or for a company that contributes to open
source, you may be thinking how the heck do these engineers convince their
leadership to let them dump so much time into these contributions? The simple
answer is, it’s good for business.

If we peep into why Bloomberg uses Trino, they aggregate data from an unusually
large number of data sources across their customers who use their services. Part
of this requires them to merge the customer’s dataset with existing aggregate
data in Bloomberg’s product. Since Trino can connect to most customer databases
out-of-the-box, this requires Bloomberg to manage a small array of custom
connectors that provide their services to customers as multiple catalogs in a
single convention SQL endpoint. Having engineers maintain a few small connectors
rather than an entire distributed query engine themselves saves a lot of time
and maintenance.

Despite how many problems Trino already solves for them, Bloomberg ideally still
wants to maintain as few things as possible to relieve their engineer’s time.
Luckily, open source projects are generally more than happy to accept features
that the community all benefit from. This doesn’t mean we shouldn’t appreciate
when companies contribute. This generosity and forward-thinking approach enabled
Erik and Teng to combine their battle-tested connectors, crafting a high value
creation for the community.

Yet another issue that has come up much more recently is the
[XZ exploit](https://www.darkreading.com/application-security/xz-utils-scare-exposes-hard-truths-in-software-security)
where attackers guised as well-meaning open source contributors who slowly put
a back door in place to enable hacking any company using this software.
Thankfully,
[backdoors only work when they are unnoticeable](https://opensourcesecurity.io/2019/08/28/backdoors-in-open-source-are-here-to-stay/)
and due to the far spread testing of open source, a Microsoft employee noticed
a performance issue that led them to find the slowly crafted exploit. Some see
this as a reason to avoid open source, while on the contrary, you’ll simply
never know how many exploits live in closed-source technology, and the number of
eyeballs able to find those exploits are far smaller. In this spirit, using and
contributing to open source becomes the safer alternative, especially as we
learn how to improve detection of exploits on public open source systems. Just
as it will seem odd that we ever relied on gas-powered vehicles one day, it will
also seem odd that we ever trusted software who’s dependencies we weren’t able
to inspect.

<iframe width="560" height="315" src="https://www.youtube.com/embed/4NAU-UyiJ8A" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

This takeaway is important for developers who want to contribute to open source,
it is imperative that developers convey a compelling story with evidence when
getting involved in open source. I don’t imagine a world of altruists, but I do
imagine an economic incentive model that helps us help each other much more than
in the past.

## Esprit de Corps

The marines use the mantra, “Esprit de Corps,” latin for “spirit of the people”,
which I mistakenly took the “Corps” part for the Marine Corps rather than the
more general meaning of a body or group of people. In fact, it expresses [the
common spirit existing in the members of a group and inspiring enthusiasm,
devotion, and strong regard for the honor of the
group.](https://www.merriam-webster.com/dictionary/esprit%20de%20corps) Any time
I see this type of shared and selfless cooperation in open source, I’m reminded
of the bond, friendships, and care of me and my fellow marines in times of war.
Despite the unfortunate political circumstances of our mission, I do treasure
the shared companionship with both my fellow marines and the local Iraqi people.
There is ultimately a power in the gathering of many when aimed for building an
altruistic means of improving each others lives.

In the same way, demonstration of human cooperation is about more than just
developing a connector; it's about the shared experiences, the friendships
forged, and the skills honed in the pursuit of a common goal. The successful
addition of the Trino Snowflake connector is a testament to the positive sum
outcomes of open source collaboration. This journey has been about
collaboration, learning, and growth that will benefit many.

Another effect is that once the upfront hard work like this is done, it makes
way for many valuable iterations like [adding Top-N
support](https://github.com/trinodb/trino/pull/21219)(Shoppee), [adding
Snowflake Iceberg REST catalog support](https://github.com/trinodb/trino/pull/21365)
(Starburst), and [adding better type mapping
support](https://github.com/trinodb/trino/pull/21365)(Apple) for the Snowflake
integration. I love showcasing this trailblazing and yes, altruistic work Erik,
Teng, Yuya, Martin, Manfred, and Piotr - and everyone who helped in the Trino
community. A special thanks to the managers and leadership at Bloomberg and
ForePaaS for their generous commitment of time and resources.

As we celebrate this milestone, we're already looking forward to the next
adventure. Here's to federating them all, together!

================================================================================
Notes:
<a name="fnref1"></a><sup><a class="footnote-ref" href="#fn1">1</a></sup>
<span class="footnote-ref-text">ForePaaS has been integrated into OVHCloud, which is renamed as Data Platform.</span>

_bits_

0 comments on commit c18b1b8

Please sign in to comment.