diff --git a/_posts/2024-05-03-trino-snowflake-bloomberg-oss-win.md b/_posts/2024-05-03-trino-snowflake-bloomberg-oss-win.md new file mode 100644 index 0000000000..b998092079 --- /dev/null +++ b/_posts/2024-05-03-trino-snowflake-bloomberg-oss-win.md @@ -0,0 +1,294 @@ +--- +layout: post +title: "Integrating Trino and Snowflake: An open source success story" +author: Brian Olsen +excerpt_separator: +canonical_url: https://bitsondata.dev/trino-snowflake-bloomberg-oss-win +--- + +We’re seeing open source usher in a challenge to the economic model where the +success metric is increasing the commonwealth of economic capital. This +acceleration comes from playing positive-sum games with friends online and +avoiding limiting a community to a vision that only benefits a small number of +corporations or individuals. It’s hard to imagine how to embed such frameworks +within our current zero-sum winner-takes-all economic system. There’s certainly +no shortage of heated debates around how to construct a harmonious relationship +between the open source community and companies participating in them. Something +we don’t talk about enough are the positive examples of when a coordinated +effort in open source sticks the landing and so many benefit from it. + +This post highlights the extraordinary contributions of +[Erik Anderson](https://www.linkedin.com/in/erikanderson/), +[Teng Yu](https://www.linkedin.com/in/tyu-fr/), +[Yuya Ebihara](https://www.linkedin.com/in/ebyhr/), and the broader +[Trino community](https://github.com/trinodb/trino) to finally contribute the +long-coveted +[Trino Snowflake Connector](https://trino.io/docs/current/connector/snowflake.html). + +It is both a success story and some wisdom for corporations using and providing +services around open source should be involved in open source . This provides a +blue print for those who want to contribute to open source the types of +patterns to follow when contributing. + +## A common challenge in open source + +Despite my love for marketing and educating a community on tech (aka +[edutainment](https://en.wikipedia.org/wiki/Educational_entertainment)), it’s +only the first part of the equation of what I aim to do as a developer advocate. +Once developers see some exciting video or tutorial, they ultimately land on +the docs site, GitHub, StackOverflow, or some communication platform in the +community. It's at this point that developers can easily lose the motivation if +the the community and docs lacks proper getting started materials or worse, a +silent community. This is how I categorize the developer experience (aka +devex), which aims to improve both the user and contributor experiences in the +developer community by +[empower decision making by doing](https://en.wikipedia.org/wiki/Experiential_learning), +[removing inefficiencies](https://trino.io/blog/2023/01/09/cleaning-up-the-trino-backlog), +and as we'll cover here, exposing untapped opportunities. + +Much like any open source project, maintainers on the Trino project struggle at +communicating our lack of proper resources to build and test new features built +on various proprietary software. For those less familiar, Trino is a federated +query engine with +[multiple data sources](https://trino.io/docs/current/connector.html). Trino +tests integrations with open data sources by running small local instances of +the connecting system. Snowflake is a proprietary cloud-native data warehouse, +also known as cloud data platform, that had no way to test this integration +that was +[eagerly](https://github.com/trinodb/trino/pull/2551#issuecomment-873082280) +[sought](https://github.com/trinodb/trino/issues/1863) +[by many](https://github.com/trinodb/trino/issues/7247). After an +[initial attempt](https://github.com/trinodb/trino/pull/2551) by my friend +[Phillipe Gagnon](https://www.linkedin.com/in/pfgagnon), a similar pattern +emerged [with the second pull request](https://github.com/trinodb/trino/pull/10387) +where the development velocity started strong and after some months stagnated. + +### Cognitive surplus and communication deficit + +One of the most unfortunate repeated issues I see in the open source projects +I've contributed to and helped maintain is that various well-known larger +objectives known among the core group often move so fast, while individual +updates that don't fit in a larger project narrative has higher likelihood to +get lost in the shuffle. As an open source project grows, you end up with a +cognitive surplus in the form of an abundance of bright people willing to share +their time, intellect, and experience with a larger community. + +Often both contributors and maintainers are so busy with their day jobs, +families, and self care, that they dedicate most of their remaining energy to +ensuring they write quality code and tests to the best of their ability. Lack of +upfront communication to validate ideas from newer contributors, and lack of +communication by maintainers who see a large number of issues to address are +two communication issues that stagnate a project. Maintainers are often doers +that see more value in addressing quick-win work that flows from the +well-established contributors of the project. Followthrough on either side can +be difficult as newcomers don't want to be rude and maintainers accidentally +forget or hope someone else will take the time to address the issues on that +pull request. + +![](https://i.snap.as/kbwKpWa7.webp) + +Waiting for your work to be reviewed by someone in the community kind of works +like a wishing well, you toss in a coin (i.e. your time and effort represented +as code and a pull request) and hope your wish of getting your code reviewed +and merged comes true. The satisfaction of hypothetical developer(s) that +benefit from your small and significant change floods your mind and you feel +like you’ve improved humanity just that one little bit more. + +Maintainers are in a constant state of pulling triage on all the surplus of +innovation being thrown at them and simultaneously trying to look for more help +reviewing and being the expert at some areas of the code. As you can imagine, +good communication can be hard to come by as many newcomers are strangers and +concerned they are wasting precious time by asking too many questions rather +than just showing a proof of concept. This backfires for multiple reasons that +go outside of the scope of this post. + +### History repeats itself, until it doesn't + +It became apparent that each time there was +[a discussion](https://github.com/trinodb/trino/pull/2551#issuecomment-709220790) +for how to do +[integration testing](https://github.com/trinodb/trino/pull/10387#issuecomment-1008430060) +there was no good way to test a Snowflake instance with the lack of funding for +the project. Trino has a high bar for quality and none of the maintainers felt +it was a risk worth taking due to the likely popularity of the integration and +likelihood of future maintenance issues. Once each pull request hit this same +fate, it stalled with nobody really knowing how to resolve the real issue of +limited financial resources of the +[Trino Software Foundation (TSF)](https://trino.io/foundation.html). It’s never +fun to mention that you can’t move forward on work with constraints like these +and yet nobody has the time to look into a monetary solution. + +Noticing that Teng had already done a significant amount of work to contribute +his Snowflake connector, I reached out to him to see if we could brainstorm a +solution. Not long after, Erik Anderson also reached out to get my thoughts on +how to go about contributing Bloomberg's Snowflake connector. Great, now we have +two connector implementations and no solution to getting the infrastructure to +get them tested. During the first +[Trino Contributor Congregation](https://trino.io/blog/2022/11/21/trino-summit-2022-recap.html#trino-contributor-congregation), +Erik and I brought up Bloomberg's desire to contribute a Snowflake connector and +I articulated the testing issue. What’s ironic was this was the first time I had +actually thoroughly articulated the issue to Erik as well. + +As soon as I was done, Erik requested the mic said something to the effect of, +"Oh I wish I would have known that's the problem, the solution is simple, +Bloomberg will provide the TSF a Snowflake account." + +Done! + +Just as in business, **you can never underestimate the power of communication in +an open source project** as well. Shortly after Erik, Teng, and I discussed the +best ways to merge their work, set up the Snowflake accounts for Trino +maintainers, and start the arduous process of building a thorough test suite +with the help of Yuya, Piotr, Manfred, and Martin. + +## The long road to Snowflake + +As Teng and Erik merged their efforts, the process was anything but +straightforward. There were setbacks, vacations, meticulous reviews, and +infrastructure issues. But the perseverance of everyone involved was unwavering. + +Bloomberg started by creating +[an official Bloomberg Trino repository](https://github.com/bloomberg/trino) +originally as a means for Teng and Erik to mesh their solutions together and +build the testing infrastructure that relied on Bloomberg resources. Without +needing to rely on the main Trino project to merge incremental solutions, they +were able to quickly iterate the early solutions. This repository also +facilitated Bloomberg’s now numerous contributions to Trino. + +It took a few months just to get the +ForePaaS1 +and Bloomberg solutions merged. There were valuable takes from each system and +better integration tests were written with the new testing infrastructure. The +two Snowflake connector implementations were merged together by April of 2023. +Finally, the reviews could start. Once the initial two passes happened we +anticipated that we would see the Snowflake connector release in the summer of +2023 near Trino Fest. So much so, that we planned +[a talk with Erik and Teng](https://trino.io/blog/2023/07/12/trino-fest-2023-let-it-snow-recap) +initially as a reveal assuming the pull request would be merged by then. Lo and +behold, this didn’t happen, as there were still a lot of concerns around use +cases not being properly tested. + +### The Halting Review Problem + +A necessary evil that comes with pull request reviews and more broadly, +distributed consensus is that reviews can drag on over time. This can lead to +[countless number of updates](https://github.com/trinodb/trino/pull/17909#issuecomment-1841809727) +you have to make to your changes to accommodate the ever changing project] +shifting beneath your feet as you simultaneously try to make progress on +[suggestions from those reviewing your code](https://github.com/trinodb/trino/pull/17909#pullrequestreview-1793724311). + +Many critics of open source like to point this out as a drawback of this model, +when in fact, this same problem exists in closed-source systems, but closed +source projects can generally delay difficult decisions to make fast upfront +progress to meet certain deadlines. This may be seen as an advantage at first, +but as many developers can attest, this simply leads to technical debt and +fragile products in most environments that struggle to prioritize a healthy +codebase. + +Regardless, having to face these larger discussions upfront can induce fatigue, +especially when managing external circumstances; personal affairs, a project at +work - you know, the entity that pays these engineers - or countless other +factors will rear their ugly heads and +[progress will stagger](https://github.com/trinodb/trino/pull/17909#discussion_r1418149737) +with ebbs and flows of attention. This can be really dangerous territory and +commonly resolves in contributors and reviewers abandoning the PR when it stalls. + +This is why I believe open source, while not beholden to any timelines, needs a +sort of project/product management role which is currently covered often by +project leaders and DevEx engineers. This can also relieve tension between the +needs of open source and big businesses in the community with real deadlines, at +least keeping the communication consistent while ensuring bugs and design flaws +aren’t introduced to the code base. + +## What’s in it for Bloomberg and ForePaaS? + +If you’ve never worked in open source or for a company that contributes to open +source, you may be thinking how the heck do these engineers convince their +leadership to let them dump so much time into these contributions? The simple +answer is, it’s good for business. + +If we peep into why Bloomberg uses Trino, they aggregate data from an unusually +large number of data sources across their customers who use their services. Part +of this requires them to merge the customer’s dataset with existing aggregate +data in Bloomberg’s product. Since Trino can connect to most customer databases +out-of-the-box, this requires Bloomberg to manage a small array of custom +connectors that provide their services to customers as multiple catalogs in a +single convention SQL endpoint. Having engineers maintain a few small connectors +rather than an entire distributed query engine themselves saves a lot of time +and maintenance. + +Despite how many problems Trino already solves for them, Bloomberg ideally still +wants to maintain as few things as possible to relieve their engineer’s time. +Luckily, open source projects are generally more than happy to accept features +that the community all benefit from. This doesn’t mean we shouldn’t appreciate +when companies contribute. This generosity and forward-thinking approach enabled +Erik and Teng to combine their battle-tested connectors, crafting a high value +creation for the community. + +Yet another issue that has come up much more recently is the +[XZ exploit](https://www.darkreading.com/application-security/xz-utils-scare-exposes-hard-truths-in-software-security) +where attackers guised as well-meaning open source contributors who slowly put +a back door in place to enable hacking any company using this software. +Thankfully, +[backdoors only work when they are unnoticeable](https://opensourcesecurity.io/2019/08/28/backdoors-in-open-source-are-here-to-stay/) +and due to the far spread testing of open source, a Microsoft employee noticed +a performance issue that led them to find the slowly crafted exploit. Some see +this as a reason to avoid open source, while on the contrary, you’ll simply +never know how many exploits live in closed-source technology, and the number of +eyeballs able to find those exploits are far smaller. In this spirit, using and +contributing to open source becomes the safer alternative, especially as we +learn how to improve detection of exploits on public open source systems. Just +as it will seem odd that we ever relied on gas-powered vehicles one day, it will +also seem odd that we ever trusted software who’s dependencies we weren’t able +to inspect. + + + +This takeaway is important for developers who want to contribute to open source, +it is imperative that developers convey a compelling story with evidence when +getting involved in open source. I don’t imagine a world of altruists, but I do +imagine an economic incentive model that helps us help each other much more than +in the past. + +## Esprit de Corps + +The marines use the mantra, “Esprit de Corps,” latin for “spirit of the people”, +which I mistakenly took the “Corps” part for the Marine Corps rather than the +more general meaning of a body or group of people. In fact, it expresses [the +common spirit existing in the members of a group and inspiring enthusiasm, +devotion, and strong regard for the honor of the +group.](https://www.merriam-webster.com/dictionary/esprit%20de%20corps) Any time +I see this type of shared and selfless cooperation in open source, I’m reminded +of the bond, friendships, and care of me and my fellow marines in times of war. +Despite the unfortunate political circumstances of our mission, I do treasure +the shared companionship with both my fellow marines and the local Iraqi people. +There is ultimately a power in the gathering of many when aimed for building an +altruistic means of improving each others lives. + +In the same way, demonstration of human cooperation is about more than just +developing a connector; it's about the shared experiences, the friendships +forged, and the skills honed in the pursuit of a common goal. The successful +addition of the Trino Snowflake connector is a testament to the positive sum +outcomes of open source collaboration. This journey has been about +collaboration, learning, and growth that will benefit many. + +Another effect is that once the upfront hard work like this is done, it makes +way for many valuable iterations like [adding Top-N +support](https://github.com/trinodb/trino/pull/21219)(Shoppee), [adding +Snowflake Iceberg REST catalog support](https://github.com/trinodb/trino/pull/21365) +(Starburst), and [adding better type mapping +support](https://github.com/trinodb/trino/pull/21365)(Apple) for the Snowflake +integration. I love showcasing this trailblazing and yes, altruistic work Erik, +Teng, Yuya, Martin, Manfred, and Piotr - and everyone who helped in the Trino +community. A special thanks to the managers and leadership at Bloomberg and +ForePaaS for their generous commitment of time and resources. + +As we celebrate this milestone, we're already looking forward to the next +adventure. Here's to federating them all, together! + +================================================================================ +Notes: +1 +ForePaaS has been integrated into OVHCloud, which is renamed as Data Platform. + +_bits_