101 lines
4.5 KiB
Markdown
101 lines
4.5 KiB
Markdown
![]() |
---
|
||
|
date: 2015-08-27
|
||
|
title: All Backup Solutions Suck
|
||
|
category: opinions
|
||
|
---
|
||
|
|
||
|
Recently I have been working a lot on a backup solution at work, which
|
||
|
has been a painful experience to say the least. Why? Simply because
|
||
|
there is no solution that meets my ideal requirements. These are pretty
|
||
|
precise:
|
||
|
|
||
|
- Open Source
|
||
|
- Openstack Swift as a backend
|
||
|
- File level restores
|
||
|
- Scalable
|
||
|
- Lightweight
|
||
|
- An idiot must be able to restore a file
|
||
|
- Not a source of truth about my infrastructure
|
||
|
- Automated restore testing
|
||
|
|
||
|
A nice bonus would be volume level backups of Openstack Cinder.
|
||
|
|
||
|
From what I can tell, nothing currently out there meets these
|
||
|
requirements. If I take away the Open Source requirement it does not get
|
||
|
much better. [Rubrik](http://www.rubrik.com) looks interesting, if it
|
||
|
weren't tied into VMware, as are a few other solutions.
|
||
|
|
||
|
Nothing meets my needs :-( Naturally this has got me thinking about how
|
||
|
something like this could be achieved, so here goes.
|
||
|
|
||
|
I am actually taking my inspiration from the monitoring world, where
|
||
|
there has been similar problems. In the past, one just went straight for
|
||
|
Nagios to do alerting and Munin/Cacti for storing metrics. For various
|
||
|
reasons these just sucked, but the big one for me was this:
|
||
|
|
||
|
> I had to tell it what it had to monitor!
|
||
|
|
||
|
Tomorrow, I may be called upon to deploy a Hadoop cluster with 100
|
||
|
slaves. All of these would have to be individually added to Nagios. This
|
||
|
invariable got forgotten and before long nothing was monitored and
|
||
|
Nagios was forgotten about. Things broke, nobody knew about it. Everyone
|
||
|
said "IT SUCKS".
|
||
|
|
||
|
However, recently I\'ve been playing around with
|
||
|
[Sensu](http://www.sensuapp.org). This works the other way round. The
|
||
|
node announces itself to the server, which has a set of rules that that
|
||
|
the node uses to monitor itself. This, allied with all the comms being
|
||
|
over a Message Queue, makes it astoundingly scalable.
|
||
|
|
||
|
This is the sort of principle that backup should follow.
|
||
|
|
||
|
You have a central server, which is essentially just an API that a node
|
||
|
can query to discover what to do. This is based on rules such as cloud
|
||
|
metadata, hostname, subnet, whatever else takes your fancy. As this
|
||
|
server is just an API, we can use an HTTP load balancer and a NoSQL
|
||
|
database to improve availability and scaling.
|
||
|
|
||
|
A new agent comes online, after being installed by my CM system. It
|
||
|
queries the API to find out what to do, it the takes it from there. The
|
||
|
only time it will interact with the central server is when its
|
||
|
configuration changes. It knows what to backup and where to put it, so
|
||
|
off it goes. It can use existing tools: `tar`, `bzip2`, `duplicity`,
|
||
|
`gpg` etc and push it directly into the storage desired (S3/Glacier or
|
||
|
Openstack Swift would be the best choices I\'d say).
|
||
|
|
||
|
Of course I mindful that not all nodes will have direct access to the
|
||
|
storage backend for many reasons. In this case, it could use the Load
|
||
|
Balancer already used for the server API to bounce off to the storage.
|
||
|
After all, this is just an HTTP stream we are talking about; even a
|
||
|
fairly lightweight HAProxy instance will be able to handle 100s of
|
||
|
streams.
|
||
|
|
||
|
So on paper, what this should give us is a backup solution that is:
|
||
|
|
||
|
- Scalable: as there is no need to define nodes on a central server,
|
||
|
there is no extra step when configuring an new node. You deploy it,
|
||
|
install the agent and it just works. Perhaps one could follow the
|
||
|
Puppet model, where it defaults to a certain hostname as the server.
|
||
|
If that is in your DNS, then you do not even need to configure the
|
||
|
agent.
|
||
|
- High performance: The processing is distributed accross literally
|
||
|
your entire infrastructure, so your backup server does not become a
|
||
|
bottleneck.
|
||
|
- Has no single point of failure: If your server is just a REST API
|
||
|
and a web app, then HA can be performed easily with well understood
|
||
|
techniques. Even if you do lose it, your backups do not stop as the
|
||
|
nodes are doing it all themselves.
|
||
|
- Restores use standard tools: If you have lost everything, there is
|
||
|
no need to bring up your backup infrastructure first in order access
|
||
|
your data. It is stored on a standard backend, created with standard
|
||
|
tools that are available on any node with just a simple
|
||
|
`apt-get`/`yum`.
|
||
|
|
||
|
All this seems obvious to me, so why has no-one done it?
|
||
|
|
||
|
Of course, a genuine backup product needs do do reporting and things
|
||
|
like that. This is another role the central server could take on: it has
|
||
|
a MongoDB cluster to store all that in. Or, that could be an \"add-on\"
|
||
|
that just hooks into the same MongoDB (UNIX principal: do one job, and
|
||
|
do it well).
|