How Snap rebuilt the infrastructure that now supports 347 million daily users

In 2017, 95% of Snap’s infrastructure was running on Google App Engine. Then came the Annihilate FSN project.

Snap, which launched in 2011, was built on GAE — FSN (Feelin-So-Nice) was the name for the original back-end system — and the majority of Snapchat’s core functionality was running within a monolithic application on it. While the architecture initially was effective, Snap started encountering issues when it became too big for GAE to handle, according to Jerry Hunter, senior vice president of engineering at Snap, where he runs Snapchat, Spectacles and Bitmoji as well as all back-end or cloud-based infrastructure services.

“Google App Engine wasn’t really designed to support really big implementations,” Hunter, who joined the company in late 2016 from AWS, told Protocol. “We would find bugs or scaling challenges when we were in our high-scale periods like New Year’s Eve. We would really work hard with Google to make sure that we were scaling it up appropriately, and sometimes it just would hit issues that they had not seen before, because we were scaling beyond what they had seen other customers use.”

Today, less than 1.5% of Snap’s infrastructure sits on GAE, a serverless platform for developing and hosting web applications, after the company broke apart its back end into microservices backed by other services inside of Google Cloud Platform (GCP) and added AWS as its second cloud computing provider. Snap now picks and chooses which workloads to place on AWS or GCP under its multicloud model, playing the competitive edge between them.

The Annihilate FSN project came with the recognition that microservices would provide a lot more reliability and control, especially from a cost and performance perspective.

“[We] basically tried to make the services be as narrow as possible and then backed by a cloud service or multiple cloud services, depending on what the service we were providing was,” Hunter said.

Snapchat now has 347 million daily active users who send billions of short videos, send photos called Snaps or use its augmented-reality Lenses.

Its new architecture has resulted in a 65% reduction in compute costs, and Hunter said he has come to deeply understand the importance of having competitors in Snap’s supply chain.

“I just believe that providers work better when they’ve got real competition,” said Hunter, who left AWS as a vice president of infrastructure. “You just get better … pricing, better features, better service. We’re cloud-native, and we intend on staying that way, and it’s a big expense for us. We save a lot of money by having two clouds.” 

The Annihilate FSN process wasn’t without at least one failed hypothesis. Hunter mistakenly thought that Snap could write its applications on one layer and that layer would use the cloud provider that best fit a workload. That proved to be way too hard, he said.

“The clouds are different enough in most of their services and changing rapidly enough that it would have taken a giant team to build something like that,” he said. “And neither of the cloud providers were interested at all in us doing that, which makes sense.”

Instead, Hunter said, there are three types of services that he looks at from the cloud.

“There’s one which is cloud-agnostic,” he said. “It’s pretty much the same, regardless of where you go, like blob storage or [content-delivery networks] or raw compute on EC2 or GCP. There’s a little bit of tuning if you’re doing raw compute but, by and large, those services are all pretty much equal. Then there’s sort of mixed things where it’s mostly the same, but it really takes some engineering work to modify a service to run on one provider versus the other. And then there’s things that are very cloud-specific, where … only one cloud offers it and the other doesn’t. We have to do this process of understanding where we’re going to spend our engineering resources to make our services work on whichever cloud that it is.”

Regionalization

Snap’s current architecture also has resulted in reduced latency for Snapchatters.

In its early days, Snap had its back-end monolith hosted in a single region in the middle of the United States — Oklahoma — which impacted performance and the ability for users to communicate instantly. If two people living a mile apart in Sydney, Australia, were sending Snaps to each other, for example, the video would have to traverse Australia’s terrestrial network and an undersea cable to the United States, be deposited in a server in Oklahoma and then backtrack to Australia.

“If you and I are in a conversation with each other, and it’s taking seconds or half a minute for that to happen, you’re out of the conversation,” Hunter said. “You might come back to it later, but you’ve missed that opportunity to communicate with a friend. Alternatively, if I have just the messaging stack sitting inside of the data center in Sydney … now you’re traversing two miles of terrestrial cable to a data center that’s practically right next to you, and the entire transaction is so much faster.”

If I want to experiment and move something to Sydney or Singapore or Tokyo, I can just do it.

Snap wanted to regionalize its services where it made sense. The only way to do that was by using microservices and understanding which services were useful to have close to the customer and which ones weren’t, Hunter said.

“Customers benefit by having data centers be physically closer to them because performance is better,” he said. “CDNs can cover a lot of the broadcast content, but when doing one-on-one communications with people — people send Snaps and Snap videos — those are big chunks of data to move through the network.”

That ability to switch regions is one of the benefits of using cloud providers, Hunter said.

“If I want to experiment and move something to Sydney or Singapore or Tokyo, I can just do it,” he said. “I’m just going to call them up and say, ‘OK, we’re going to put our messaging stack in Tokyo,’ and the systems are all there, and we try it. If it turns out it doesn’t actually make a difference, we turn that service off and move it to a cheaper location.”

Delta Force

Snap has built more than 100 services for very specific functions, including Delta Force. 

In 2016, any time a user opened the Snapchat app, it would download or redownload everything, including stories that a user had already looked at but hadn’t yet timed out in the app.

“It was … a naive deployment of just ‘download everything so that you don’t miss anything,’” Hunter said. “Delta Force goes and looks at the client … finds out all the things that you’ve already downloaded and are still on your phone, and then only downloads the things that are net-new.”

This approach had other benefits.

“Of course, that turns out to make the app faster,” Hunter said. “It also costs us way less, so we reduced our costs enormously by implementing that single service.”

Open source

Snap uses open-source software to create its infrastructure, including Kubernetes for service development, Spinnaker for its application team to deploy software, Spark for data processing and memcached/KeyDB for caching. “We have a process for looking at open source and making sure we’re comfortable that it’s safe and that it’s not something that we wouldn’t want to deploy in our infrastructure,” Hunter said.

Snap also uses Envoy, an edge and service proxy and universal data plane designed for large, microservice service-mesh architectures.

“I actually feel like … the way of the future is using a service mesh on top of your cloud to basically deploy all your security protocols and make sure that you’ve got the right logins and that people aren’t getting access to it that shouldn’t,” Hunter said. “I’m happy with the Envoy implementations giving us a great way of managing load when we’re moving between clouds.”

Cloud primitives, ‘moving fast’ and cost camp

Hunter prefers using primitives or simple services from AWS and Google Cloud rather than managed services. A Snap philosophy that serves it well is the ability to move very fast, Hunter said.

“I don’t expect my engineers to come back with perfectly efficient systems when we’re launching a new feature that has a service as a back end,” he said, noting many of his team members previously worked for Google or Amazon. “Do what you have to do to get it out there, let’s move fast. Be smart, but don’t spend a lot of time tuning and optimizing. If that service doesn’t take off, and it doesn’t get a lot of use, then leave it the way it is. If that service takes off, and we start to get a lot of use on it, then let’s go back and start to tune it.”

Our total compute cost is so large that little bits of tuning can have really large amounts of cost savings for us.

It’s through that tuning process of understanding how a service operates where cycles of cloud usage can be reduced and result in instant cost savings, according to Hunter.

“Our total compute cost is so large that little bits of tuning can have really large amounts of cost savings for us,” he said. “If you’re not making the sort of constant changes that we are, I think it’s fine to use the managed services that Google or Amazon provide. But if you’re in a world where we’re constantly making changes — like daily changes, multiple-times-a-day changes — I think you want to have that technical expertise in house so that you can just really be on top of things.”

Three factors figure into Snap’s ability to reap cost savings: the competition between AWS and Google Cloud, Snap’s ability to tweeze out costs as a result of its own work and going back to the cloud providers and looking at their new products and services.

“We’re in a state of doing those three things all the time, and between those three, [we save] many tens of millions of dollars,” Hunter said.

Snap holds a “cost camp” every year where it asks its engineers to find all the places where costs possibly could be reduced.

“We take that list and prioritize that list, and then I cut people loose to go and work on those things,” he said. “On an annual basis depending on the year, it’s many tens of millions dollars of cost savings.”

Adding a third cloud provider and advice on going multicloud 

Snap has considered adding a third cloud provider, and it could still happen some day, although the process is pretty challenging, according to Hunter.

“It’s a big lift to move into another cloud, because you’ve got those three layers,” he said. “The agnostic stuff is pretty straightforward, but then once you get to mixed and cloud-specific, you’ve got to go hire engineers that are good at that cloud, or you’ve got to go train your team up on … the nuances of that cloud.”

Enterprises considering adding another cloud provider need to make sure they have the engineering staff to pull it off: 20 to 30 dedicated cloud people as a starting point, Hunter said.

“It’s not cheap, and second, that team has to be pretty sophisticated and technical,” he said. “If you don’t have a big deployment, it’s probably not worth it. I think about a lot of the customers I used to serve when I was in AWS, and the vast majority of them, their implementations … were serving their company’s internal stuff, and it wasn’t gigantic. If you’re in that boat, it’s probably not worth the extra work that it takes to do multicloud.”

Leave a Reply

Your email address will not be published.