Wednesday, 8 February 2017

Understanding Base64 Encoding #5

Tier 5

This tier is aimed at filling in a few gaps, showing the wider applicability of base64 encoding, and pointing to further reading.

Padding: The Trailing Equals Character

When I first looked at the characters used in base64 encoding I noticed there was a cheeky 65th character (‘=’) sometimes appearing once or twice at the end of encoded data. It’s actually a special character used for when source binary data doesn’t divide neatly into three byte blocks. A quick example to illustrate.

Imagine I want to base64 encode the following four 8-bit bytes:
01000001 01100100 01100001 01101101

I take the first three octets:
01000001 01100100 01100001

Represent them as four sextets:
010000 010110 010001 100001

And encode using my encoding key, producing: QWRh

But now I have a lonely, final octet left to encode: 01101101

In base64 encoding it’s simply padded out with trailing zeros until we have another three octets:
01101101 00000000 00000000

And converted it to sextets as normal:
011011 010000 000000 000000

Any sextet which contains nothing but padded zeros gets represented as ‘=’.

So the rest of the encoded data becomes: bQ==.

The ‘=’ character is a bit of a courtesy and not every implementation of base64 encoding uses it; it is possible to recreate the original binary data without using ‘=’ for padding, it’s is just more explicit to include it.

Other Uses:

Base64 encoding is typically used in scenarios where representing binary data as a limited set of ASCII characters is desirable. This could be when using an 8-bit (or greater) character encoding isn’t viable, or when you wish to embed binary data in a explicitly text-based medium, or when sending non-alpha-numeric characters could be an issue.  

Attachments to emails are base64 encoded, as are the username and passwords sent for basic HTTP authentication. The specifics of why base64 encoding is used in these scenarios is beyond this series, but reading about https://en.wikipedia.org/wiki/8-bit_clean and https://en.wikipedia.org/wiki/Email_attachment gives you a good idea of why this is the case. The below quote taken from the Email Attachment Wikipedia page gives a good sense of the history:

“Originally Internet SMTP email was 7-bit ASCII text only, and attaching files was done by manually encoding 8-bit files using uuencode, BinHex or xxencode and pasting the resulting text into the body of the message.”

Further Resources:

Once you grasped the basics of base64 encoding the Wikipedia article actually becomes useful. To my mind it’s missing a Tier 1 style explanation but it otherwise quite passable.

There’s an Oracle blog post which is also good – again, if you’ve got some base knowledge to work from.

And when you want to go full nerd there’s the IETF spec!

No comments:

Post a Comment