Big Data Backup to S3 Glacier via Java SDK

Goal: Work with the Amazon AWS S3 and Glacier Java SDKs (v2) to implement automatic big data backups – 200GB, growing by 72GB a year – to Glacier cold cloud storage, save money on data archival, and add peace of mind.

Why Glacier? It’s exceedingly inexpensive to archive data for disaster recovery in Glacier. Glacier storage is only US$0.004 per GB/mo, and their SDK is beautiful. Ordinary S3 alone is relatively expensive for the use case, Google Drive and Dropbox are overpriced for big-data archival, and the others are unknowns.

Let’s compare apples to apples on cost first.

Expected Cost

AWS S3US$2.30US$4.60US$11.50US$23US$46US$230
AWS S3US$27.60US$55.20US$138US$276US$552US$2,760

The pricing sweet spot for big data storage is between 200GB and 2TB with Glacier, plus Amazon has an easy-to-use Java SDK and Maven POM already available. I added Degoo and CrashPlan because of their interesting pricing models.

AWS Glacier is the clear winner with the goal of big data backup at under a US dollar per month for 200GB.

A First Look at the AWS Glacier Java SDKs

AWS Glacier Java SDK examples
AWS Glacier Java SDK examples

Before settling on the AWS ecosystem, first I experimented with the ListVaults example. It was quick to set up a custom AWS policy for Glacier uploads, assign this policy to a new user who will have a Glacier-only scope, allow programmatic interaction (to get the secret keys for API interaction), and create a Vault.

Set up Maven Dependencies

Setting up the Maven SDK BOM (bill of materials) dependency manager and the Glacier plugin was quick. But, the overwhelming majority of sample code is in version 1 of the SDK, so I chose to include both v1 and v2 SDKs side-by-side. Here are the API differences between the two versions.

List Glacier Vaults

My Java SDK v2 test code to get the vault descriptors is below.

Create a Glacier Vault

Next I try to create a vault with the SDKv2.

This short snippet creates a new vault perfectly. In fact, repeated calls to create the same vault raise no exceptions.

Upload a Test Archive to Glacier

Next, I’d like to use the high-level upload manager library (v1 only) to test uploading to Glacier. When this is available for v2 I may come back and refactor the code.

Uploading worked for a 44MB test file, albeit slowly, taking about 1.2 minutes with an upload speed of about 5Mbps – that’s 0.6MB/s.

Slow upload speed to Glacier
Slow upload speed to Glacier
Is something slow with Glacier? I modified the example code several times, but it always resulted in an upload speed of 5Mbps. That led me to run a speed test which also showed 5Mbps. Having not audited my internet needs for a while, I neglected to realize my available internet has a very slow upload speed.

List Glacier Vault Contents

Here is the first wrinkle I encountered. It takes from a few hours to half a day to inventory each vault. I cannot verify the archives were uploaded nor their checksums. After creating a job and hours of waiting for that inventory job to complete, I could get back a stream that resolved into a JSON response. Here is a sample response:

The first thing I noticed is that even though I uploaded the same file multiple times, the data is treated as an object with a description, a timestamp, and some unique id. There is no file name associated with the object. That means I had better include the original filename in the description. Also, files are not overwritten. This is definitely different than uploading to S3 or casual-user cloud storage.

Note: Anything uploaded to Glacier is billed for a minimum of 90 days even if it is deleted before then.

AWS S3 Glacier Storage Class and Lifecycle Policies

Searching out how others solved this problem I discovered that some people upload their archives to expensive AWS S3 but store them as the Glacier storage class. This can be set up with so-called lifecycle policies to do this automatically. The uploads are Glacier-priced, but are accessed via the S3 API, and bucket (not vault) inventory is instantaneous. However, there is no notion of vaults and vault locks with the S3 API, and retrieval times are just as long.

Remember: Amazon Glacier != S3 Glacier storage class
Amazon S3 storage classes
Amazon S3 storage classes (source)

Here is an example of setting the bucket lifecycle policy to transition S3 objects immediately to the Glacier storage class (set days to 0). Interestingly, through S3 only (not Glacier), there is a Deep Glacier storage class that has a minimum billing time of 6 months, but is even cheaper than Glacier at $0.00099 per GB/mo. Since Glacier requires a minimum billing time of 90 days, the objects will transition from the Glacier class to the Deep Glacier class after 90 days, and there will be no S3 billing.

S3 lifecycle policy to transition to Glacier
S3 lifecycle policy to transition to Glacier

Having experienced the limitations of a pure-Glacier implementation, the S3 with Glacier and Deep Glacier storage classes looks to be a better alternative I will explore next.

Update the Maven Dependencies

I’ll remove SDK v1 dependencies to get used to SDK v2.

Upload a Test Archive to S3

After manually setting up a bucket with permissions, encryption, and more whiz-bang settings, my first test in code was to upload a file to S3. There are no folders in S3 as it is like a giant key:value data store, so I abstracted the concept of folders to stay consistent with my workflow. The simplest upload code follows.

Sure enough my file appears in the AWS console. This is much more convenient than uploading directly to Glacier.

AWS S3 dashboard after API upload
AWS S3 dashboard after API upload

However, the object (file) didn’t immediately transition to Glacier storage. I tried again without specifying the storage class parameter. Again, nothing. Finally I tried again and specified StorageClass.GLACIER which did work. I need to explore why the lifecycle policies didn’t kick in.

Manually saving to Glacier storage class in S3
Manually saving to Glacier storage class in S3

I’m working on this now. Results and what I’ve learned will follow…

Strategies for Continually Archiving Big Data

I’m at a crossroads. With a 5Mbps upload speed it will take 22.8 hours to upload a 50GB compressed archive of my financial time-series data to Glacier periodically. I could make this a scheduled weekly job on Sunday (all-day Sunday), or I could explore faster internet providers. Unfortunately, the telcos in British Columbia, Canada all have mutually slow upload speeds, possibly in an effort to nudge power users to business plans. Fiber optic coverage is spotty in Metro Vancouver or else that would be perfect as the symmetric internet speeds of fiber are ideal. Let’s assume I’m limited to 5Mbps for now (recall that on-prem hardware is far less expensive than cloud processing).

Backup Atomically

One option is to run a start-to-finish Glacier upload job every Sunday taking 23 hours to upload a 50GB-and-growing compressed archive. With AWS Glacier, the backup can be uploaded in parts. If the upload is interrupted, it can be resumed as long as it is within 24 hours. This provides 24 hours for a 23-hour job at 5Mbps, weekly.

Pros: Coding is straightforward. Cons: The operation takes 23 hours, but over time will increase. Failure anywhere results in a total loss of the whole archive. Taking longer than 24 hours to upload results in a total loss as well.

Backup Incrementally

Before any ETL, all the big data is stored over 55,000+ SQLite files totally nearly 200GB. Each one can be compressed individually and uploaded to Glacier. Over the course of the week the previous week’s archives can be uploaded steadily. This provides 168 hours for a 23-hour job at 5Mbps, weekly.

It’s minute, but I discovered that Amazon reserves 32KB for metadata per archive within Glacier, instead of 8KB per object in S3, both of which are charged back to the user. For my use case, that is about 55,000 files at 32KB, or 1.68GB of overhead (US$0.006/month).

Pros: An entire week is available for uploading. Cons: A disaster during the week leaves the archive incomplete. More upload management and tracking is required.

Backups Deltas

Every day a rolling shadow copy of the most recent one-week of data is updated. Uncompressed this is about 5GB (2.3 hours to upload at 5Mbps). Compressed this is about 1GB (28 minutes at 5Mbps). This strategy provides 168 hours for a 28-minute job, weekly.

Pros: Weekly uploads are extremely quick. Coding is straightforward. Cons: Disaster recovery requires a complete initial archive plus the chain of deltas to assemble the complete data again.

Backup Strategy in Practice

I’ve decided to do what I do with my hard drive backups: create a full backup, create a chain of deltas, create another full backup, and repeat.

I’m working on this now. Results and what I’ve learned will follow…


  1. They say unlimited data for the same price