Anthony Lewis: The Sprout Data Pipeline: NSQ and Beyond

Thursday, August 4, 2016

The Sprout Data Pipeline: NSQ and Beyond

One engineering challenge we face at Sprout Social is the diverse ways our customers want to access their social data. Social media is fast-moving and real-time but also requires long-term insights via reports and analytics. For example, a global hotel chain might use Sprout's Smart Inbox to enable each of its locations to interact with local customers while at the same time empowering its social media manager to track trends and macro impact over time via Sprout Reports. These use cases have very different technical design considerations that Sprout engineers work on every day.

The Sprout Smart Inbox is a “soft real-time” system, whereas our social reports are produced in batches. Each of these stacks is unique in its technology. The Sprout reporting stack is made up of mostly big aggregate metrics that are generated from Hadoop-related tools, namely HBase, Map/Reduce and Pig. The Smart Inbox is powered primarily by finely tuned MySQL databases coupled with Cassandra for our bigger message indices.

Both of these areas of our application have different querying requirements, yet each system needs to be sourced from the same upstream data. In order to power these diverse indexing and querying needs, we built a data pipeline to service all these different backend systems. The key component that glues all of these systems together is our NSQ-powered message bus.

The Case for NSQ

When we looked at Sprout's data distribution needs, we had the following goals:

No central points of failure. Because this system feeds data to all of our core Sprout services, any centralized technology puts the entire platform at risk.

Independent, on-demand access to data. Engineering teams should be able to consume data they need à la carte, independently of other squads.

Monitoring and observability “for free.” By consuming or producing data on the system, you should get monitoring and observability “for free.” The NSQ HTTP API allows us observe, alert and collect statistics on all producers and consumers in the topology.

Data replay for backfilling data. This includes extreme cases such as disaster recovery and total datacenter loss. This gives us an extra layer of redundancy on top of our standard database backup procedures.

With these goals in mind, we evaluated several different competing messaging technologies and narrowed it down to two finalists:

Apache Kafka

It's important that whenever you evaluate any core piece of technology, you're honest about the real engineering tradeoffs involved. At Sprout, we ask ourselves several key questions when adopting a new technology:

Do we already have a tool that can do the job?

How closely does the technology match our business needs?

How much operational effort is required to support the system?

How mature is the ecosystem?

Kafka and NSQ both meet similar messaging needs, but with some important differences:

Kafka distributes its brokers, but still lends itself to shared centralized Kafka clusters. NSQ cleanly separates control plane (nsqlookupd) from data plane (nsqd) and allows you to deploy your messaging topology in more flexible configurations.

Kafka is a log-based messaging system, whereas NSQ is more of a distributed queueing system. If you need replay functionality in NSQ, you'll have to build it yourself (more on this later).

NSQ ships with key operational tooling out of the box, such as the nsqadmin web control panel, a built-in HTTP API for monitoring each nsqd instance, and other useful utilities such as nsq_tail and nsq_to_file. Because of its Go implementation, it ships as a statically linked standalone binary and is very simple to configure.

NSQ strikes a great balance for Sprout's data distribution use cases. We can strategically place queue storage daemons (nsqd) locally with our data producer services in order to increase reliability and prevent message loss in cases of network partitions. For example, our service that consumes real-time events from Twitter can buffer social messages to an instance of an nsqd that's running on the same server. This is a key feature that would be extremely difficult to achieve had we chosen Kafka for our core messaging system.

Running NSQ in Production

We've been running NSQ in production for over a year now, and we couldn't be happier with the results. NSQ leaves a small system footprint, keeps up with our workloads well and is remarkably stable.

Some other noteworthy stats from our current running production topology:

Our global message production rate averages between 1,000-3,000 messages per second.

Peak message rates can spike as high as 15,000 messages per second with no noticeable strain on the system.

Average channel fanout multiple is roughly 7x, meaning that for every message we produce, we average roughly 7 different services that consume that message.

Highest channel fanout multiple is 95x. This is a topic that fires an event whenever a customer attaches a new social profile to the Sprout platform.

15 nsqd processes, coordinated with a single 3 node lookup cluster. About half of these nsqd instances run server local, and the other half run as highly available centralized pairs.

87 unique topics. Most topics feed direct customer features, but many topics feed offline analytics processes for our data science team.

More than 37 billion messages produced with the system over the past year. The reliable developer self-service model of the system, coupled with organic customer data growth, continues to push these numbers ever upwards.

Tooling on Top of NSQ

In a future blog post, we will outline some other key features that we built on top of NSQ in order to make it production-ready for the Sprout platform: