Reliable Event Delivery System at Spotify

Basic info: Spotify clients generate up to 1.5 million events per second.

Spotify clients generate up to 1.5 million events per second at peak hours and all are handled by their Event Delivery System, which is designed to have a predictable latency and to never lose an event

The presentation is limited to the attendees of the QCon London, so I need to read through the blog to find the information I need (https://labs.spotify.com/2016/02/25/spotifys-event-delivery-the-road-to-the-cloud-part-i/).

If you don’t know about Kafka, read the following post:

https://sookocheff.com/post/kafka/kafka-in-a-nutshell/

(It’s like a mix of Kinesis and SQS)

Season 1 The Crappy Old Way

Their old system design:

old-system-design

It is dumb that these Hadoop guys created so many nonsense terms. Kafka Broker is just a Kafka node.

Kafka 0.7 surprisingly doesn’t support persistent on Kafka Dumb Brokers, which means not until the events are persisted on HDFS, they (Spotify, if you ask the even dumber question ‘who are they’) cannot claim that the data is persisted.

This is so dumb and they (Spotify!!!!) again created a stupid solution that the “Liveness Monitor” first read from HDFS to confirm that the data is persisted, then it notifies the producer that the data is persisted, then the producer produce a termination (EOF) signal and send it to Kafka, Kafka send it to HDFS. Until this point ETL could say the job is finished and load it to some internal datastore for future processing.

Yes. This is what Spotify was doing, in season one.

When producers become unavailable and events are only half way persisted to HDFS (i.e. no EOF), data loss may occur.

It is easy to ask tough questions, and there are definitely defects in the system. However we need to remember that we live in an imperfect world and we accept well calculated risk – we are software engineers.

(I feel disgusted when people keep asking what-if questions during system design phase. This is an important lesson I learned from my graduate program. You should always say “If … then we will screw … up” – it poses a real question/problem to solve, not just trying to make you sound smarter. )

Season 2 The Fancy Cloud (Which Gets Engineers Laid) (No, Laid Off)

What they had in mind:

new-system-design

What they tried but miserably failed:

gabo-kafka-system-design

What they end up with:

gabo-system-design-2x

Mirror Maker gave us the most headaches. We assumed it would reliably mirror data between data centers, but this simply wasn’t the case. It only mirrored data on a best effort basis. If there were issues with the destination cluster, the Mirror Makers would just drop data while reporting to source cluster that data had been successfully mirrored. (Note that this behaviour should be fixed in Kafka 0.9.)

Mirror Makers occasionally got confused about who was the leader for consumption. The leader would sometimes forget that it was a leader, while the other Mirror Makers from the cluster would happily still try to follow it. When this happened, mirroring between data centers would stop.

The Kafka Producer also had serious issues with stability. If one or more brokers from a cluster was removed, or even just restarted, it was quite likely that the producer would enter a state from which it couldn’t recover by itself. While it was in such a state it wouldn’t produce any events. The only solution was to restart the whole service.

This is just another proof that opensource movement cannot solve all problem because they don’t have enough experts in distributed systems. 太山寨。

You may want to argue that these software will evolve and eventually fix the issues. However, look at MongoDB, Redis, Riak… LOL.

(For publish) To account for the future growth and possible disaster recovery scenarios, we settled on a test load of 2M events per second.

(For subscription) For the duration of the test we published, on average, around 800K messages per second. To mimic real world load variations, the publishing rate varied according to the time of day.

Based on these tests, we felt confident that Cloud Pub/Sub was the right choice for us. Latency was low and consistent, and the only capacity limitations we encountered was the one explicitly set by the available quota. In short, choosing Cloud Pub/Sub rather than Kafka 0.8 for our new event delivery platform was an obvious choice.

So problem solved.

LOL.

A “system design”. So sad.

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s