99

Backing up a database on Amazon S3

Backing up a database is necessary for every type of project. Whether you are dealing with financial transactions and need backups every 10 minutes or you are running a hobby site and you require a weekly backup, you still need to find a way to keep a copy of your database readily available. Depending on your database there may be tools and services that do the job. Here are some examples:

  1. If you use Postgres and Amazon then you are probably using Amazon Relational Database Service - RDS . This is a great service since it allows you to create servers that provide a Postgres service without setting up the server yourself. In other words, instead of you spinning up a Linux EC2 instance, installing Postgres (and then maintaining - or as it usually happens, not maintaining it), Amazon does the job for you. This service also provides a backup mechanism which allows you to automatically store backups on a daily basis for up to a certain perion (I think about a month). This is great if your needs are met with just a single daily backup that spans over a short period.
  2. If you are using mongo, then mongo mms is an excellent service.

OR...

You can just create your own backup mechanism with a cronjob, a small shell script and an aws account. This is a generic solution which can be used for any type of database.

The shell script

You need to automate the backup process so I put together a shell script that takes a single argument (the s3 folder - more on that later):

#!/bin/bash
# /home/auser/db_backup.sh
NOW=$(date +"%Y%m%d_%H%M%S")
cd /home/auser/somefolderyourdontcareabout;/usr/bin/pg_dump -U postgres mydatabase | gzip > mydatabase.$NOW.sql.gz

/usr/local/bin/aws s3 mv mydatabase.$NOW.sql.gz s3://my-bucket-on-s3/db-backup/$1/



Lets look at it line by line. Create a timestamp:

NOW=$(date +"%Y%m%d_%H%M%S")

It is a useful practice to know exactly when your backup was taken. When you client asks you why the 200 million dollar money transfer did not go through this evening, then you probably want to dissect a copy of your database from around that time (and not a stale version from a couple of days ago). The timestamp will be your lifesaver.



Create a backup file from your database:

cd /home/auser/somefolderyourdontcareabout;/usr/bin/pg_dump -U postgres mydatabase | gzip > mydatabase.$NOW.sql.gz

Next move into a directory which you will use for temporarily storing the backup (a tmp type dir) and run your backup command. In this case it was pg_dump which works for postgres. Taylor that to your database type (e.g. mongodump for mongo). Notice that I am piping the output to gzip. This is an optional step. In my case I am backing up a database which is about 1GB, so I want to save as much space as possible.



Finally transfer the file to S3.

/usr/local/bin/aws s3 mv mydatabase.$NOW.sql.gz s3://my-bucket-on-s3/db-backup/$1/

Notice that I am using the full path for executing the aws command (see more on the aws-cli ). Keep in mind that the script may run via a cronjob in which case it is not guaranteed that the PATH will be set. Also there is a variable $1 at the end of the S3 path. Again this is optional. It is there in case you want to send the backup to different folders. The cronjobs will clarify this further.

Cronjob

At the bottom of my crontab file I have:

AWS_CONFIG_FILE="/home/auser/.aws/config"

0 * * * * /bin/sh /home/auser/db_backup.sh hourly
0 0 * * * /bin/sh /home/auser/db_backup.sh daily
0 0 * * 6 /bin/sh /home/auser/db_backup.sh weekly

The script runs every:

  1. hour and sends the backup to the hourly folder in S3
  2. day and sends the backup to the daily folder in S3
  3. week and sends the backup to the weekly folder in S3

This way I have a very flexible mechanism for keeping backups of my databases at regular intervals. Notice that above the cronjobs I have set the AWS_CONFIG_FILE environment variable. In order to run the script you need to have authentication for accessing the aws api. The credentials are stored in the path stored in the AWS_CONFIG_FILE variable which needs to be set in the crontab file. The PATH is not available when you run cronjobs so if you think that you need environment variables, make sure that you declare them inside he cronfile.

Housekeeping

Now you have your backups and very soon you will have many more than what you actually need. For example, the hourly backups may be useful for anything that happened over the last week, but do not have any value for data that is a couple of months old. Also, accumulating data is not free. S3 is pretty cheap but eventually the bill will inflate unnecessarily. S3 does come with mechanisms for deleting files after a certain period, or moving them for 'deep storage' to Glacier which is cheaper. You can find more information on s3 lifecycle managment here.

In one of my setups I have implemented the following rules:

  1. items in the hourly folder are moved to Glacier after 2 months
  2. items in the daily folder are moved to Glacier after 1 year
  3. items in the weekly folder are kept in S3 storage for ever (... and ever)

It is worth knowing that S3 also supports versioning. There is probably a better way of backing up than the one I am suggesting which makes the most of versioning. If anyone has already done it, then please let me know!



comments powered by Disqus