Syringe.Net.Nz
Irregular Injection of Opinion
RSS 2.0|Atom 1.0|CDF

 Saturday, August 08, 2009
Simple Partitioning with Windows Azure Table Storage

While there are certainly situations where it makes sense to have a natural PartitionKey when working with Azure storage there are other times when all you really want is a simple way to bucket up your data into equal bins. The usual approach to partitioning is going to be some sort of hash function but if you decide to use a Guid as the RowKey for your data you’ve basically got a nice collision resistant equal distribution already, you just need to turn it into a partition key.

I was sitting on the plane back from Singapore having a bit of a think about this. Given that we can represent our Guid as a 128 bit Interger we can probably just do RowKey % PartitionCount  and get a nice simple ordinal for each partition.

So after my birthday dinner I did what any dedicated birthday boy would do and broke out Visual Studio for a bit of a hack around..

First problem was the ‘128 bit integer’ as .NET doesn’t have a native BigInt type. A quick bit of Tiwtter asking and @adjames suggested the BigInteger class in .NET 4.0, but, given this is Azure there’s no .NET 4.0 support quite yet. A bit of Binging (is that a verb yet?) found some posts on StackOverflow and an implementation of a BigInteger class on CodePlex.

A quick console application confirmed that my thinking on the plane was right.

static void Main(string[] args)
        {
            int[] counts = new int[]{0,0,0,0,0,0,0,0,0,0};
            DateTime start = DateTime.Now;
            for (int i = 0; i < 1000000; i++)
            {
                Guid g = Guid.NewGuid();
                BigInteger b = new BigInteger(g.ToByteArray());
                BigInteger c = new BigInteger(10); //Number of partitions
                int p = BigInteger.ToInt32(BigInteger.Abs(b % c));
                //Console.WriteLine(g.ToString() + " : " + p.ToString());
                counts[p] += 1;

            }
            DateTime end = DateTime.Now;
            TimeSpan duration = end - start;
            Console.WriteLine("Took: " + duration.TotalMilliseconds + " milliseconds");
            Console.WriteLine(counts.ToString(","));
            Console.ReadKey();
        }

Running this confirmed that my RowKey values would be evenly distributed across the 10 partitions- my concern here was that the Guid algorithm might not be quite up to the task but all seems good.

image

6500ms for a million rows doesn’t look too bad on the face of it. I’m sure there are plenty of performance optimizations to be eeked out, but, they’ll pale into insignificance compared to a round trip to Azure storage via the load balancer. What I do need to test is that it’s not more efficient to rehash the Guid into 64 bits and then calculate the modulo. But that’s for another night- jaded now and hoping to do 100km on the roadie in the morning.

.NET | Windows Azure|Saturday, August 08, 2009 7:50:00 AM UTC|Comments [1]|