@duncan

Working with Azure blob md5 digests

June 21, 2018

Azure Blob Storage provides MD5 digests of files in base64 encoding. This causes a small issue because many command line tools — as well as AWS S3 — use hex encoding for their digests. You’ll need to do some conversions if you want to compare files using the two different digests.

What are MD5 digests and why would we care about comparing them? In a nutshell, they provide a short constant-sized checksum for a file. No matter how big a file is, you can get a summary that represents the state of the file using a standard function. Take a digest from a 4MB file (or a 4TB one!) on your local computer and one generated on a server of a file you want to compare it to, and you only need to transfer 32 or so bytes to tell if the contents are the same or not.

Well, that’s not strictly true. There are collisions in the MD5 algorithm that make it unacceptable for cryptographically secure purposes. When you know that you’re comparing what should be the same file on both sides of a network connection, however, it’s perfect for the job. Comparing digests instead of transfering entire files makes a directory sync process much faster.

Ok. Enough about that. Let’s look at how to compare digests in the two different encodings, which is the whole purpose of this post.

Using shell

The command line md5 tool on macOS (also known as md5sum on Linux) for my website’s 404 page on my local filesystem gives:

$ md5 -q public/404.html 
"3b75d4041511720a85f535897008d14b"

If I look at that same file’s checksum as returned by Azure (I’m using the az and jq command line tools to fetch the properties of the blob and select out the right field from the returned JSON), I get:

$ az storage blob show -c \$web -n 404.html | 
    jq -r .properties.contentSettings.contentMd5
"O3XUBBURcgqF9TWJcAjRSw=="

There’s the difference between the hex and base64 encodings. Comparing those two digests won’t give us the results we want. Enter xxd, the command line hexdump tool, and base64 the command line base64 encoder:

$ md5 -q public/404.html | xxd -r -p | base64
"O3XUBBURcgqF9TWJcAjRSw=="

Perfect. Now we can compare digests. And, as you’d expect, we can run everything in reverse to go the other way:

$ echo "O3XUBBURcgqF9TWJcAjRSw==" | base64 -D | xxd -p
"3b75d4041511720a85f535897008d14b"

Put all of the above together, and we can get the hex-encoded hash direct from Azure with:

$ az storage blob show -c \$web -n 404.html |  
    jq -r .properties.contentSettings.contentMd5 | 
    base64 -D | 
    xxd -p
"3b75d4041511720a85f535897008d14b"

Now, we’re cooking. If you’re a hardcore shell user, like my friend Nathan Herald (who taught me all about the magical command-line JSON processor that is jq), you can wrap these pipelines up in some bash functions and you’re set.

Using ruby

While I like proving things out using shell, the reason I went down this rabbit hole in the first place was to sync my website content with Azure Blob Storage. I use Rake to build my site, so let’s look at this in Ruby.

Grabbing the hex encoded digest of a file in Ruby is straightforward:

require 'digest'

path = "public/404.html"
hexdigest = Digest::MD5.hexdigest(File.read(path))
# hexdigest is "3b75d4041511720a85f535897008d14b"

Getting the base64 encoded version is just as easy:

base64digest = Digest::MD5.base64digest(data)
# base64digest is "O3XUBBURcgqF9TWJcAjRSw=="

If you need to go between the two, however, because you’re comparing a digest you have from AWS S3 with one from Azure Blob Storage, it’s a bit more work. This requires getting comfortable with Ruby’s Array#pack and String#unpack methods. To go from hex to base64 encoding, you can do:

base64digest = [[hexdigest].pack("H*")].pack("m0")
# base64digest is "O3XUBBURcgqF9TWJcAjRSw=="

And, to do the reverse from base64 to hex:

base64digest.unpack("m")[0].unpack("H*")[0]
# returns "3b75d4041511720a85f535897008d14b"

These incantations aren’t exactly pretty and don’t really expose their intent well, but once figured out they work nicely and can be hidden away behind a function somewhere.

Putting things together

I mentioned that I went down this rabbit hole to upload my website to Azure blob storage. Here’s was the next step of proving out how to do that in Ruby, and an example of why using MD5 digests is useful:

require 'azure/storage/blob'

path = "public/404.html"
data = File.read(path)
local_digest = Digest::MD5.base64digest(data)

client = Azure::Storage::Blob::BlobService.create
blob = client.get_blob_properties("$web", "404.html")
remote_digest = blob.properties[:content_md5]

if local_digest != remote_digest 
  client.create_block_blob("$web", "404.html", data)
end

After this, the next step was to get all the hashes for both the remote and local files, compare them, and upload the ones that changed. However, that’s an exercise I’ll leave to the intrepid reader.