Friday 27 November 2015

The Why of the Kilobyte (and data sizes generally)

I am a Computer Science graduate and a developer of ten years. Embarrassingly, it took me until last night to jump down the rabbit hole of the terminology used when describing quantities of data. As usual, I didn't find exactly what I was looking for on the internet, so here are my thoughts:

I like to think I understand binary, in a rudimentary fashion at least. I can explain that it's a base-2 number system, having two symbols to represent its numbers: "0" and "1". I can show you how to count in a base-2 system and show you why it works that way. I can contrast it with a base-4, base-10 or a base-16 system and show how those works. I can perform basic binary addition. Essentially, I'm trying to establish my credentials as someone who isn't a complete binary dullard.

I also understand that one bit (Binary digIT) isn't an awful lot of use on its own. It can be on/off, high/low, true/false - however you choose to describe it - but only in context and combination with other bits does it become interesting and useful. And this is where my journey down the rabbit hole began...

Let me start with good old, recognisable base-10. It has ten symbols to use when representing numbers: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9. We - people - have chosen to assign special names to particular, neat representations of quantity in this system:

1 = one = 10^0
10 = ten = 10^1
100 = hundred = 10^2
1000 = thousand = 10^3
1000000 = million = 10^6
1000000000 = billion = 10^9 (old British billion: 1,000,000,000,000 = 10^12)

I haven't worked out quite why we decided those particular representations were worthy of their own name; it feels like an addition chain, especially if you go with the old British billion: 0, 1, 2, 3, 6, 9|12.

With this in mind we approach base-2, where the ground appears to completely shift. We start by giving names to collections of bits, seemingly more interested in the range of numbers a collection of bits can represent than the numbers themselves. So..

0 = bit = 0 to 1
0000 = nibble | nyble = 0 to 15
0000 0000 = byte = 0 to 255

You think "okay, well it's a different world, things area different here... maybe a different pattern is used". Once you get your head round it's collections of bits (ranges of numbers) are given names, rather than numbers themselves, then maybe you can work out the pattern. Maybe 16 bits or 32 bits have a special name? Nope. It's all madness from here on in!

0000 0000 0000 0000 = 2 bytes | 16 bits
0000 0000 0000 0000 0000 0000 0000 0000 = 4 bytes | 32 bits

What appears to have happened is that someone decided bits are no longer interesting and that... wait for it... quantities of bytes are interesting (completely eschewing the lowly bit) and decide either 1000 or 1024 (depending on your stance) is an interesting quantity of these byte things to be concerned about. I can only imagine being interested in ~1000 of these thing is the spectre of base-10 hovering over the decision making.

1024 x byte = kilobyte
1024 x kilobyte = megabyte
1024 x megabyte - gigabyte

If someone can explain the why behind this thinking I'll be greatly appreciative. I can only imagine that "kilo" and "mega" are impositions from the world of base-10 and that multiples of bytes is interesting because 8 bits can represent a character (as per ASCII or some machine instruction).

Thursday 5 November 2015


I agonise over naming things when it comes to coding; names convey intention and purpose and are one of the first things you rub up against when trying to figure a out new concept or someone else's code - or your own from longer than a few days ago.

It's in this spirit I want to rename Closures as Captors (or Captures). When you read into it and discover that the term "closure" is used in reference to "closing over variables", I submit you immediately think "what?" and then "I wonder if them mean capture a variable?".