Scalying New Heights: Dealing with Hyper Growth, Kubernetes, and Getting the Goals Right at Robinhood

Robinhood is an online trading platform that is on a mission to democratize finance and make it possible for everyone to participate in the financial system.  The company has experienced explosive growth in the past year, the growing pains that go with it, and some important lessons along the way. Adam Wolff spent two years as the VP of Engineering.  He is also a member of ENG, the peer-network of VPEs and CTOs at leading SaaS companies. Adam joined me for an interview on Scalying New Heights, which you can watch here. Below are a few highlights.

https://www.youtube.com/watch?v=NZLevav5FZU

First, a bit of context. As online trading and crypto currency explodes, Robinhood has enjoyed a huge increase in new user signups plus expansion of trading on the platform, and this impacted every part of the business and the underlying infrastructure. As new users grew quickly, along with trading volumes, the engineering team needed to be sure they were hitting the new peaks during their own load tests, not as part of the business runtime.  Adam explains it this way:

Scaling is about building teams, tools and processes that can continually handle new highs.

If you’ve ever tried to build these kinds of tools and these kinds of frameworks, you know it’s a non-trivial engineering task.  As with most things in business, leadership is key, and leaders themselves are continually challenged to set appropriate goals, push teams to change or adopt new technology, and find the right balance between innovation and optimization. Adam’s experience is emblematic of what happens when the business objectives and the tools to achieve them get mixed up.

When Adam joined Robinhood there was a new project to implement–Kubernetes–and at first he thought that sounded like a good idea. “The business objective was simply to move to Kubernetes.” He had come from Facebook, where they had sophisticated and high quality frameworks for deploying different kinds of jobs in a very abstract way. You didn’t have to think about it, you could deploy new code and it just worked. The marketing materials for Kubernetes, and the demo, made it look like exactly what they needed, so he jumped on board. But years later, they’re still working on the migration, and they’ve made a lot of progress, but it was more difficult than expected. In this respect, they are not alone.

When you deploy a technology like Kubernetes, it’s not just one team that’s impacted, it’s every team — infrastructure platform, security, product, and so forth. “When you frame the goal in terms of the technology, you leave room for interpretation by all of these teams, and each one of these teams will bring their own goals, desires, and fantasies to the project.” Without a clear and unified business objective, the Kubernetes project had no unified, defined direction. The real goal was to help product teams deploy software quickly. THAT was the objective.

“In retrospect, it’s pretty easy to say that management and leadership — myself — made a mistake in saying that we wanted Kubernetes. Because Kubernetes is not what I want. I actually have no real opinion about the technology that we use, what I want is faster release velocity.”

Ultimately what I really want is immutable infrastructure

“I want to get to the point where it’s easier and faster to deploy a new build with the fix, than it is to go back and try to change anything.”

When setting the direction for another major investment in scalability, Adam wanted to avoid the mistakes he felt were made with Kubernetes. A financial services company like Robinhood will have a variety of offerings for customers, such as a crypto service, an equities trading service, an options trading service, etc. The classic way is to scale these things independently. So the options trading service would be sharded against multiple databases, and the crypto service would separately be sharded against multiple databases. But most things at Robinhood are account-based, so this scaling strategy does not work and also allow the company to create a network-effect among users. A Robinhood financial account is, by design, something that’s only accessed by one or maybe a couple of users, so it’s desirable to put one account in one shard. This allows separate crypto options and equity services, with a controller that knows the overall sharding is for all the accounts, while the individual services are blissfully unaware, which limits the number of interservice connections, and the number of hosts that need to talk to each other.

What we found, experiencing this hyper growth, is that the second order effects get you.

“You can almost always find a way to scale something by adding more machines, or tuning a database query or something like that.  But ultimately they may need to access the service location system, or Kafka, and it’s these connections that ultimately cause problems.”

Robinhood’s new cell-based architecture increases fault isolation and reduces several of the scaling pain points. The ultimate goal was to achieve a flexible architecture that allowed the company to release reliable code, quickly. Framing the goal in this way allowed measurable milestones, such as achieving fewer connections among hosts, or containing the blast radius when there are production issues. Framing the goals in the right way provides direction for the team and frames every conversation within and among teams, and creates alignment in a more natural way.

“We need to set goals that truly advance the cause of the business, not of engineering itself. I often get caught up in the technical details of or the coolness of a new solution. It’s really important for leadership to zoom out and ask, what are we trying to do? And maybe even more importantly, how will we know if we’re doing it? And when we set goals in these terms they have a very different flavor and tenor, and they leave a different kind of room for the engineering team to go pursue them.”

Many of us have spent our whole careers in the tech industry because we love tech. It’s easy to get excited by the next new technology. But as leaders, it’s important we think about the business goals first, and then the best ways to achieve them. You may want to use your own resources, build everything from scratch and customize the heck out of it, and then maintain and develop it yourself. Or, you may want to take something off the shelf and just move fast, focusing your own resources on things that are central to the business. This is always the dilemma, and the path through it begins with framing up the results you expect, versus the path to get there.

You know, this is a lesson I’ve learned again and again.

About the Author

Christine Heckart is CEO of Scalyr, which provides the industry’s most scaled log analytics SaaS offering and a unique Event Data Cloud, a DPaaS that delivers analytics as a service for event data and can be integrated with existing dashboards, user interfaces, and custom applications. Scalyr also created and curates a peer-network of VPEs, CTOs, and top technical executives at leading SaaS companies called ENG (Engage, Network, Grow). To learn more about Scalyr or to join ENG, visit www.scalyr.com.