initial commit
This commit is contained in:
commit
ca6a4d45d1
113 changed files with 10501 additions and 0 deletions
100
content/blog/all-backup-solutions-suck/index.md
Normal file
100
content/blog/all-backup-solutions-suck/index.md
Normal file
|
@ -0,0 +1,100 @@
|
|||
---
|
||||
date: 2015-08-27
|
||||
title: All Backup Solutions Suck
|
||||
category: opinions
|
||||
---
|
||||
|
||||
Recently I have been working a lot on a backup solution at work, which
|
||||
has been a painful experience to say the least. Why? Simply because
|
||||
there is no solution that meets my ideal requirements. These are pretty
|
||||
precise:
|
||||
|
||||
- Open Source
|
||||
- Openstack Swift as a backend
|
||||
- File level restores
|
||||
- Scalable
|
||||
- Lightweight
|
||||
- An idiot must be able to restore a file
|
||||
- Not a source of truth about my infrastructure
|
||||
- Automated restore testing
|
||||
|
||||
A nice bonus would be volume level backups of Openstack Cinder.
|
||||
|
||||
From what I can tell, nothing currently out there meets these
|
||||
requirements. If I take away the Open Source requirement it does not get
|
||||
much better. [Rubrik](http://www.rubrik.com) looks interesting, if it
|
||||
weren't tied into VMware, as are a few other solutions.
|
||||
|
||||
Nothing meets my needs :-( Naturally this has got me thinking about how
|
||||
something like this could be achieved, so here goes.
|
||||
|
||||
I am actually taking my inspiration from the monitoring world, where
|
||||
there has been similar problems. In the past, one just went straight for
|
||||
Nagios to do alerting and Munin/Cacti for storing metrics. For various
|
||||
reasons these just sucked, but the big one for me was this:
|
||||
|
||||
> I had to tell it what it had to monitor!
|
||||
|
||||
Tomorrow, I may be called upon to deploy a Hadoop cluster with 100
|
||||
slaves. All of these would have to be individually added to Nagios. This
|
||||
invariable got forgotten and before long nothing was monitored and
|
||||
Nagios was forgotten about. Things broke, nobody knew about it. Everyone
|
||||
said "IT SUCKS".
|
||||
|
||||
However, recently I\'ve been playing around with
|
||||
[Sensu](http://www.sensuapp.org). This works the other way round. The
|
||||
node announces itself to the server, which has a set of rules that that
|
||||
the node uses to monitor itself. This, allied with all the comms being
|
||||
over a Message Queue, makes it astoundingly scalable.
|
||||
|
||||
This is the sort of principle that backup should follow.
|
||||
|
||||
You have a central server, which is essentially just an API that a node
|
||||
can query to discover what to do. This is based on rules such as cloud
|
||||
metadata, hostname, subnet, whatever else takes your fancy. As this
|
||||
server is just an API, we can use an HTTP load balancer and a NoSQL
|
||||
database to improve availability and scaling.
|
||||
|
||||
A new agent comes online, after being installed by my CM system. It
|
||||
queries the API to find out what to do, it the takes it from there. The
|
||||
only time it will interact with the central server is when its
|
||||
configuration changes. It knows what to backup and where to put it, so
|
||||
off it goes. It can use existing tools: `tar`, `bzip2`, `duplicity`,
|
||||
`gpg` etc and push it directly into the storage desired (S3/Glacier or
|
||||
Openstack Swift would be the best choices I\'d say).
|
||||
|
||||
Of course I mindful that not all nodes will have direct access to the
|
||||
storage backend for many reasons. In this case, it could use the Load
|
||||
Balancer already used for the server API to bounce off to the storage.
|
||||
After all, this is just an HTTP stream we are talking about; even a
|
||||
fairly lightweight HAProxy instance will be able to handle 100s of
|
||||
streams.
|
||||
|
||||
So on paper, what this should give us is a backup solution that is:
|
||||
|
||||
- Scalable: as there is no need to define nodes on a central server,
|
||||
there is no extra step when configuring an new node. You deploy it,
|
||||
install the agent and it just works. Perhaps one could follow the
|
||||
Puppet model, where it defaults to a certain hostname as the server.
|
||||
If that is in your DNS, then you do not even need to configure the
|
||||
agent.
|
||||
- High performance: The processing is distributed accross literally
|
||||
your entire infrastructure, so your backup server does not become a
|
||||
bottleneck.
|
||||
- Has no single point of failure: If your server is just a REST API
|
||||
and a web app, then HA can be performed easily with well understood
|
||||
techniques. Even if you do lose it, your backups do not stop as the
|
||||
nodes are doing it all themselves.
|
||||
- Restores use standard tools: If you have lost everything, there is
|
||||
no need to bring up your backup infrastructure first in order access
|
||||
your data. It is stored on a standard backend, created with standard
|
||||
tools that are available on any node with just a simple
|
||||
`apt-get`/`yum`.
|
||||
|
||||
All this seems obvious to me, so why has no-one done it?
|
||||
|
||||
Of course, a genuine backup product needs do do reporting and things
|
||||
like that. This is another role the central server could take on: it has
|
||||
a MongoDB cluster to store all that in. Or, that could be an \"add-on\"
|
||||
that just hooks into the same MongoDB (UNIX principal: do one job, and
|
||||
do it well).
|
Loading…
Add table
Add a link
Reference in a new issue