Big Data Backup to S3 Glacier via Java SDK

Goal: Work the Amazon S3 and Glacier Java SDKs (v2) to implement automatic daily big data backups – 200GB SQL base, growing by 72GB a year – to either S3 or Glacier cold cloud storage, minimize spend on data archival, and add peace of mind.

Why Glacier (or Glacier storage class)? It’s exceedingly inexpensive to archive data for disaster recovery in Glacier. Glacier storage is only US$0.004 per GB/mo, and its SDK is beautiful. Ordinary S3 alone is relatively expensive for the use case, Google Drive and Dropbox are overpriced for big-data archival, and the others are unknowns.

Let’s compare apples to apples on cost first.

Expected Cost

Monthly100GB200GB500GB1TB2TB10TB
GoogleC$2.79C$4C$14C$140
DropboxC$13C$28
DegoofreeUS$3US$10
CrashPlanUS$10US$10US$10US$10US$101US$101
AWS S3US$2.30US$4.60US$11.50US$23US$46US$230
GlacierUS$0.40US$0.80US$2US$4US$8US$40
Yearly100GB200GB500GB1TB2TB10TB
GoogleC$28C$40NANAC$140C$1,680
DropboxC$129C$279
DegoofreeUS$36US$120
CrashPlanUS$120US$120US$120US$120US$1201US$1201
AWS S3US$27.60US$55.20US$138US$276US$552US$2,760
GlacierUS$4.80US$9.60US$24US$48US$96US$480

The pricing sweet spot for big data storage is between 200GB and 2TB with Glacier, plus Amazon has an easy-to-use Java SDK and Maven dependencies already available. I added Degoo and CrashPlan because of their interesting pricing models.

AWS Glacier is the clear winner with the goal of big data backup at under a US dollar per month for 200GB.

A First Look at the AWS Glacier Java SDKs

AWS Glacier Java SDK examples
AWS Glacier Java SDK examples

Before settling on the AWS ecosystem, first I experimented with the ListVaults example. It was quick to set up a custom AWS policy for Glacier uploads, assign this policy to a new user who will have a Glacier-only scope, allow programmatic interaction (to get the secret keys for API interaction), and create a Vault.

Set up Maven Dependencies

Setting up the Maven SDK BOM (bill of materials) dependency manager and the Glacier plugin was quick. But, the overwhelming majority of sample code is in version 1 of the SDK, so I chose to include both v1 and v2 SDKs side-by-side. Here are the API differences between the two versions.

List Glacier Vaults

My Java SDK v2 test code to get the vault descriptors is below.

Create a Glacier Vault

Next, I try to create a vault with the SDKv2.

This short snippet creates a new vault perfectly. In fact, repeated calls to create the same vault raise no exceptions.

Upload a Test Archive to Glacier

Next, I’d like to use the high-level upload manager library (v1 only) to test uploading to Glacier. When this is available for v2 I may come back and refactor the code.

Uploading worked for a 44MB test file, albeit slowly, taking about 1.2 minutes with an upload speed of about 5Mbps – that’s 0.6MB/s.

Slow upload speed to Glacier
Slow upload speed to Glacier
Is something slow with Glacier? I modified the example code several times, but it always resulted in an upload speed of 5Mbps. That led me to run a speed test which also showed 5Mbps. Having not audited my internet needs for a while, I neglected to realize my available internet has a very slow upload speed.

Deal Breaker: List Glacier Vault Contents

Here is a wrinkle I encountered. It takes from a few hours to half a day to inventory each vault. I cannot verify the archives were uploaded nor their checksums. After creating a job and hours of waiting for that inventory job to complete, I could get back a stream that resolved into a JSON response. Here is a sample response:

I noticed is that even though I uploaded the same file multiple times, the data is treated as an object with a description, a timestamp, and some unique id. There is no file name associated with the object. That means I had better include the original filename in the description. Also, files are not overwritten. This is definitely different than uploading to S3 or casual-user cloud storage. This storage workflow is a deal-breaker.

Note: Anything uploaded to Glacier is billed for a minimum of 90 days even if it is deleted before then.

Next Idea: AWS S3 Glacier Storage Class and Lifecycle Policies

Searching out how others solved this problem I discovered that some people upload their archives to expensive AWS S3 but store them as the Glacier storage class. This can be set up with so-called lifecycle policies to do this automatically. The uploads are Glacier-priced but are accessed via the S3 API, and bucket (not vault) inventory is instantaneous. However, there is no notion of vaults and vault locks with the S3 API, and retrieval times are just as long.

Remember: Amazon Glacier != S3 Glacier storage class
Amazon S3 storage classes
Amazon S3 storage classes (source)

Here is an example of setting the bucket lifecycle policy to transition S3 objects immediately to the Glacier storage class (set days to 0). Interestingly, through S3 only (not Glacier), there is a Deep Glacier storage class that has a minimum billing time of 6 months but is even cheaper than Glacier at $0.00099 per GB/mo. Since Glacier requires a minimum billing time of 90 days, the objects will transition from the Glacier class to the Deep Glacier class after 90 days, and there will be no S3 billing.

S3 lifecycle policy to transition to Glacier
S3 lifecycle policy to transition to Glacier

Having experienced the limitations of a pure-Glacier implementation, the S3 with Glacier and Deep Glacier storage classes looks to be a better alternative I will explore next.

Update the Maven Dependencies

I’ll remove SDK v1 dependencies to get used to SDK v2.

Upload a Test Archive to S3

After manually setting up a bucket with permissions, encryption, and more whiz-bang settings, my first test in code was to upload a file to S3. There are no folders in S3 as it is like a giant key:value data store, so I abstracted the concept of folders to stay consistent with my workflow. The simplest upload code follows.

Sure enough, my file appears in the AWS console. This is much more convenient than uploading directly to Glacier.

AWS S3 dashboard after API upload
AWS S3 dashboard after API upload

However, the object (file) didn’t immediately transition to Glacier storage. I tried again without specifying the storage class parameter. Again, nothing. Finally, I tried again and specified StorageClass.GLACIER which did work. I need to explore why the lifecycle policies didn’t kick in.

Manually saving to Glacier storage class in S3
Manually saving to Glacier storage class in S3
Solved: The lifecycle policy is working. There is a delay before the STANDARD storage class becomes GLACIER. Also, because I keep uploading the same test file, it kicks the file out of the lifecycle pipeline as if waiting for me to make up my mind.

Paranoid Data-Integrity Checks

AWS allows you to calculate the MD5 digest locally, upload the object, then it will calculate the MD5 digest also and compare them. If they are different, the job will fail. I added this bit of paranoid defensive code below.

Bottom Line on S3 with Glacier Storage Class

S3 with lifecycle policies to move objects to the Glacier storage class is the best solution for my use case. I can overwrite archives with the same name, inventory the S3 bucket immediately after upload, and deeply manage permission siloing. Using S3 without a deliberate workflow is not a panacea, however. I’m going to explore the cost of many PUT requests to S3 – they can add up quickly – and how to reduce this cost.


Strategies for Continually Archiving Big Data

I’m at a crossroads. With a 5Mbps upload speed, it will take 22.8 hours to upload a 50GB compressed archive of my financial time-series data to S3+Glacier periodically. I could make this a scheduled weekly job on Sunday (all-day Sunday), or I could explore faster internet providers. Unfortunately, the telcos in British Columbia, Canada all have mutually slow upload speeds, possibly in an effort to nudge power users to business plans. Fiber optic coverage is spotty in Metro Vancouver or else that would be perfect as the symmetric internet speeds of fiber are ideal. Let’s assume I’m limited to 5Mbps for now (recall that on-prem hardware is far less expensive than cloud processing).

1. Backup Atomically

One option is to run a start-to-finish S3+Glacier upload job every Sunday taking 23 hours to upload a 50GB-and-growing compressed archive. With S3, the backup can be uploaded in parts. If the upload is interrupted, it can be resumed as long as it is within 24 hours. This provides 24 hours for a 23-hour job at 5Mbps, weekly.

Pros: Coding is straightforward. Cons: The operation takes 23 hours, but over time will increase. Failure anywhere results in a total loss of the whole archive. Taking longer than 24 hours to upload results in a total loss as well.

2. Backup Individually

Before any ETL, all the big data is stored over 55,000+ SQLite files totally nearly 200GB. Each one can be compressed individually and uploaded to S3+Glacier. Over the course of the week, the previous week’s archives can be uploaded steadily. This provides 168 hours for a 23-hour job at 5Mbps, weekly.

It’s minute, but I discovered that Amazon reserves 32KB for metadata per object within Glacier, and 8KB per object in S3, both of which are charged back to the user. For this workflow, that is about 55,000 files at 8KB each or 430MB of overhead (US$0.002/month). However, what is not minute is that this strategy requires 55,000 x 4 (weekly) PUT requests, or 220,000 PUT requests a month (US$1.10/month). If you’re keeping track, this doubles our monthly storage cost!

Pros: An entire week is available for uploading. Cons: A disaster during the week leaves the archive incomplete. More upload management and tracking is required. This solution doubles cost or more.

3. Backups Deltas

Every week a rolling copy of the most recent one-week data is updated. Uncompressed this is about 5GB (2.3 hours to upload at 5Mbps). Compressed this is about 1GB (28 minutes at 5Mbps). This strategy provides 168 hours for a 28-minute job, weekly.

Pros: Weekly uploads are extremely quick. Coding is straightforward. Cons: Disaster recovery requires a complete initial archive plus the chain of deltas to assemble the complete data again. This doesn’t address chaining, or periodically uploading a complete 50GB+ archive.

4. Backup Chaining with Archive Volumes

Every weekend a 50GB+ full archive is created (taking 5 hours). Every weekday a rolling 1GB+ delta archive is created (taking 2 hours). These archives are split into parts like Zip volumes. These parts are then uploaded in the background over the course of the week, taking about half an hour for the dailies, and 23 hours for the full archive.

In the screenshot below I first uploaded the entire dailies, then I switched to uploading 512MB parts as a proof of concept.

S3 backup in parts example
S3 backup in parts example
Pros: Daily delta uploads are extremely quick. Full archives are generated and uploaded weekly. A total RAID disaster will lose at most one day of data. Cons: Full archives are in dozens of parts (volumes). Coding is not straightforward. Ninety days worth of redundant archives (S3+Glacier minimum billing) must be maintained (~660GB, or US$2.60/month)

Backup Strategy in Practice

I’ve decided to do what I do with my desktop hard drive backups: create a full backup, create a chain of deltas, create another full backup, and repeat.

First, a full archive is created each weekend. Below is a console snapshot showing about 57,000 SQLite database files being compressed to zip files. An x means the DB file hasn’t changed and doesn’t need re-compression. This takes about 5 hours to compress 200GB+ to around 50GB cumulative. These zipped files are then stored in a single archive (like tar) with a common name so the hard disk doesn’t fill up. This single archive is then rsync’d to another machine on the LAN at gigabit speeds, but not to S3, yet.

Full individual database compression
Full individual database compression (with faux highlighting)

Similarly, an archive of daily deltas (with a few days of overlap) is created occupying about 1GB. This happens daily in the evening. The reason for the overlap is because if a day is missed or lost, for whatever reason, then the subsequent delta will include the missing day.

Full archive and delta (shadow) archive
Full archive and delta (shadow) archive
You’d think that because the shadow archive is 1/50th the size of the full archive, but represents the most recent four days, that there are 200 days in the full archive (50 x 4). There is much more than this. The reason is that the SQLite file format has overhead that cannot be compressed further.

Next, the single 50GB archive is split into about one-hundred approximately-512MB parts. Each part is uploaded to S3 in a background worker so after about 23 hours at 5Mbps the entire archive is transferred. After each part is uploaded, it is removed from the local machine. If there is an interruption, the upload queue continues from the last pending part to upload. This strategy incurs a cost on the order of only 100 PUT requests.

Full archive split into 97 parts
Full archive split into 97 parts

After nearly a day all 97 parts are uploaded to S3 without issue.

Full archive uploaded to S3 in 97 parts
Full archive uploaded to S3 in 97 parts

Half a day later the storage class changes to Glacier according to the lifecycle rules. It all works as intended.

Results: I was able to select a competitive and reliable cloud backup vendor (AWS), vet both Glacier and S3+Glacier workflows, and get experience with the S3 Java SDK v2 to implement automatic daily and weekly big data backups using an incremental backup approach utilizing background workers. I expect to be billed for 660GB+ of backups a month due to the minimum Glacier-class billing period of 90 days – this is about US$2.64/month for my peace of mind.

Update: Here is a graph over time of the size of my S3 archives bucket showing the progression of full archives and incremental archives. The latter two slopes indicating the upload of the full archives are steeper the leftmost slope because I did upgrade my internet package from 5 Mbps up to 7.5 Mbps up.

S3 Glacier archive storage progression
S3 Glacier archive storage progression

Notes:

  1. They say unlimited data for the same price