Cloud Foundry (
CF)
is a platform-as-a-service that, once deployed, makes it easy for
developers to deploy, run and scale web applications. Powering this
elegant PAAS is a complex distributed system comprised of several
interoperating components: the Cloud Controller (
CC) accepts user input and directs Droplet Execution Agents (
DEAs) to stage and run web applications. Meanwhile, the
Router maps inbound traffic to web-app instances, while the
Loggregator streams log output back to developers. All these components communicate via
NATS, a performant message bus.
It’s possible to boil CF down to a relatively simple mental model: users inform CC that they
desire applications and Cloud Foundry ensures that those applications, and only those applications, are
actually running on DEAs. In an ideal world, these two states – the desired state and the actual state – match up.
Of course, distributed systems are not ideal worlds and we can – at best – ensure that the
desired and
actual states come into
eventual agreement. Ensuring this
eventual consistency is the job of yet another CF component: the Health Manager (
HM).
In terms of our simple mental model, HM’s job is easy to express:
- collect the desired state of the world (from the CC via HTTP)
- collect the actual state (from the DEAs via application heartbeats over NATS)
- perform a set diff to find discrepancies – e.g. missing apps or extra (rogue) apps
- send START and STOP messages to resolve these discrepancies
Despite the relative simplicity of its domain, the Health Manager has
been a source of pain and uncertainty in our production environment. In
particular, there is only one instance of HM running. Single points of
failure are A Bad Thing in distributed systems!
The decision was made to rewrite HM. This was a strategic choice that
went beyond HM – it was a learning opportunity. We wanted to learn
how to rewrite CF components in place, we wanted to try out new technologies (in particular:
Golang and
etcd), and we wanted to explore different patterns in distributed computing.
HM9000
is the result of this effort and we’re happy to announce that it is
ready for prime-time: For the past three months, HM9000 has been
managing the health of Pivotal’s production AWS deployment at
run.pivotal.io!
As part of HM9000′s launch we’d like to share some of the lessons
we’ve learned along the way. These lessons have informed HM9000′s design
and will continue to inform our efforts as we modernize other CF
components.
Laying the Groundwork for a Successful Rewrite
Replacing a component within a complex distributed system must be
done with care. Our goal was to make HM9000 a drop-in replacement by
providing the same external interface as HM. We began the rewrite effort
by steeping ourselves in the legacy HM codebase. This allowed us to
clearly articulate the responsibilities of HM and generate
tracker stories for the rewrite.
Our analysis of HM also guided our design decisions for HM9000. HM is
written in Ruby and contains a healthy mix of intrinsic and incidental
complexity. Picking apart the incidental complexity and identifying the
true – intrinsic complexity – of HM’s problem domain helped us
understand what problems were difficult and what pitfalls to avoid. An
example might best elucidate how this played out:
While HM’s responsibilities are relatively easy to describe (see the
list enumerated above) there are all sorts of insidious race conditions
hiding in plain-sight. For example, it takes time for a desired app to
actually start running. If HM updates its knowledge of the desired state
before the app has started running it will see a missing app and
attempt to START it. This is a mistake that can result in duplicates of
the app running; so HM must – instead – wait for a period of time before
sending the START message. If the app starts running in the intervening
time, the pending START message must be invalidated.
These sorts of race conditions are particularly thorny: they are hard
to test (as they involve timing and ordering) and they are hard to
reason about. Looking at the HM codebase, we also saw that they had the
habit – if not correctly abstracted – of leaking all over and
complicating the codebase.
Many Small Things
To avoid this sort of cross-cutting complexification we decided to
design HM9000 around a simple principle: build lots of simple components
that each do
One Thing Well.
HM9000 is comprised of 8 distinct components. Each of these
components is a separate Go package that runs as a separate process on
the HM9000 box. This separation of concerns forces us to have clean
interfaces between the components and to think carefully about how
information flows from one concern to the next.
For example, one component is responsible for periodically fetching
the desired state from the CC. Another component listens for app
heartbeats over NATS and maintains the actual state of the world. All
the complexity around maintaining and trusting our knowledge of the
world resides
only in these components. Another component,
called the analyzer, is responsible for analyzing the state of the world
— this is the component that performs the set-diff and makes decisions.
Yet another component, called the sender, is responsible for taking
these decisions and sending START and STOP messages.
Each of these components has a clearly defined domain and is
separately unit tested. HM9000 has an integration suite responsible for
verifying that the components interoperate correctly.
This separation of components also forces us to think in terms of
distributed-systems: each mini-component must be robust to the failure
of any other mini-component, and the responsibilities of a given
mini-component can be moved to a different CF component if need be. For
example, the component that fetches desired state could, in principle,
move into the CC itself.
Communicating Across Components
But how to communicate and coordinate between components? CF’s answer
to this problem is to use a message bus, but there are problems with
this approach: what if a message is dropped on the floor? How do you
represent the state of the world at a given moment in time with
messages?
With HM9000 we decided to explore a different approach: one that
might lay the groundwork for a future direction for CF’s inter-component
communication problem. We decided, after extensive benchmarking, to use
etcd a high-availability hierarchical key-value store written in Go.
The idea is simple, rather than communicate via messages the HM9000
components coordinate on data in etcd. The component that fetches
desired state simply updates the desired state in the store. The
component that listens for app heartbeats simply updates the actual
state in the store. The analyzer performs the set diff by querying the
actual state and desired state and placing decisions in the store. The
sender sends START and STOP messages by simply acting on these
decisions.
To ensure that HM9000′s various components use the same schema, we
built a separate ORM-like library on top of the store. This allows the
components to speak in terms of semantic models and abstracts away the
details of the persistence layer. In a sense, this library forms the API
by which components communicate. Having this separation was crucial –
it helped us DRY up our code, and gave us one point of entry to change
and version the persisted schema.
Time for Data
The power behind this data-centered approach is that it takes a
time-domain problem (listening for heartbeats, polling for desired
state, reacting to changes) and turns it into a data problem: instead of
thinking in terms of responding to an app’s historical timeline it
becomes possible to think in terms of operating on data sets of
different configurations. Since keeping track of an app’s historical
timeline was the root of much of the complexity of the original HM, this
data-centered approach proved to be a great simplification for HM9000.
Take our earlier example of a desired app that isn’t running yet:
HM9000 has a simple mechanism for handling this problem. If the analyzer
sees that an app is desired but not running it places a decision in the
store to send a START command. Since the analyzer has no way of knowing
if the app will eventually start or not, this decision is marked to be
sent after a grace period. That is the extent of the analyzer’s
responsibilities.
When the grace period elapses, the sender attempts to send a START
command; but it first checks to ensure the command is still valid (that
the missing app is still desired and still missing). The sender does not
need to know
why the START decision was made, only whether or not it is currently valid. That is the extent of the sender’s responsibilities.
This new mental model – of timestamped decisions that are verified by
the sender – makes the problem domain easier to reason about, and the
codebase cleaner. Moreover, it becomes much easier to unit and
integration test the behavior of the system as the correct state can be
set-up in the store and then evaluated.
There are several added benefits to a data-centered approach. First,
it makes debugging the mini-distributed system that is HM9000 much
easier: to understand what the state the world is all you need to do is
dump the database and analyze its output. Second, requiring that
coordination be done via a data store allows us to build components that
have no in-memory knowledge of the world. If a given component fails,
another copy of the component can come up and take over its job —
everything it needs to do its work is already in the store.
This latter point makes it easy to ensure that HM9000 is not a single
point of a failure. We simply spin up two HM9000s (what amounts to 16
mini-components, 8 for each HM9000) across two availability zones. Each
component then vies for a lock in etcd and maintains the lock as it does
its work. Should the component fail, the lock is released and its
doppelgänger can pick up where it left off.
Why etcd?
We evaluated a number of persistent stores (etcd, ZooKeeper,
Cassandra). We found etcd’s API to be a good fit for HM9000′s problem
space and etcd’s Go driver had fewer issues than the Go drivers for the
other databases.
In terms of performance, etcd’s performance characteristics were
close to ZooKeeper’s. Early versions of HM9000 were naive about the
amount of writes that etcd could handle and we found it necessary to be
judicious about reducing the write load. Since all our components
communicated with etcd via a single ORM-like library, adding these
optimizations after the fact proved to be easy.
We’ve run benchmarks against HM9000 in its current optimized form and
find that it can sustain a load of 15,000-30,000 apps. This is
sufficient for all Cloud Foundry installations that we know of, and we
are ready to shard across multiple etcd clusters if etcd proves to be a
bottleneck.
Why Go?
In closing, some final thoughts on the decision to switch to Golang.
There is nothing intrinsic to HM9000 that screams Go: there isn’t a lot
of concurrency, operating/start-up speed isn’t a huge issue, and there
isn’t a huge need for the sort of systems-level manipulation that the Go
standard library is so good at.
Nonetheless, we really enjoyed writing HM9000 in Go over Ruby. Ruby
lacks a compile-step – which gives the language an enviable flexibility
but comes at the cost of limited static analysis. Go’s (blazingly fast)
compile-step alone is well worth the switch (large scale refactors are a
breeze!) and writing in Go is expressive, productive, and developer
friendly. Some might complain about verbosity and the amount of
boiler-plate but we believe that Go strikes this balance well. In
particular, it’s hard to be as terse and obscure in Go as one can be in
Ruby; and it’s much harder to metaprogram your way into trouble with Go
;)
We did, of course, experience a few problems with Go. The
dependency-management issue is a well-known one — HM9000 solves this by
vendoring dependencies with
submodules.
And some Go database drivers proved to be inadequate (we investigated
using ZooKeeper and Cassandra in addition to etcd and had problems with
Go driver support). Also, the lack of generics was challenging at first
but we quickly found patterns around that.
Finally, testing in Go was a joy. HM9000 became something of a proving-ground for
Ginkgo & Gomega and we’re happy to say that both projects have benefited mutually from the experience.
