Simple Partitioning with Windows Azure Table Storage
While there are certainly situations where it makes sense to have a natural PartitionKey when working with Azure storage there are other times when all you really want is a simple way to bucket up your data into equal bins. The usual approach to partitioning is going to be some sort of hash function but if you decide to use a Guid as the RowKey for your data you’ve basically got a nice collision resistant equal distribution already, you just need to turn it into a partition key. I was sitting on the plane back from Singapore having a bit of a think about this. Given that we can represent our Guid as a 128 bit Interger we can probably just do RowKey % PartitionCount and get a nice simple ordinal for each partition. So after my birthday dinner I did what any dedicated birthday boy would do and broke out Visual Studio for a bit of a hack around.. First problem was the ‘128 bit integer’ as .NET doesn’t have a native BigInt type. A quick bit of Tiwtter asking and @adjames suggested the BigInteger class in .NET 4.0, but, given this is Azure there’s no .NET 4.0 support quite yet. A bit of Binging (is that a verb yet?) found some posts on StackOverflow and an implementation of a BigInteger class on CodePlex. A quick console application confirmed that my thinking on the plane was right. static void Main(string[] args)
{
int[] counts = new int[]{0,0,0,0,0,0,0,0,0,0};
DateTime start = DateTime.Now;
for (int i = 0; i < 1000000; i++)
{
Guid g = Guid.NewGuid();
BigInteger b = new BigInteger(g.ToByteArray());
BigInteger c = new BigInteger(10); //Number of partitions
int p = BigInteger.ToInt32(BigInteger.Abs(b % c));
//Console.WriteLine(g.ToString() + " : " + p.ToString());
counts[p] += 1;
}
DateTime end = DateTime.Now;
TimeSpan duration = end - start;
Console.WriteLine("Took: " + duration.TotalMilliseconds + " milliseconds");
Console.WriteLine(counts.ToString(","));
Console.ReadKey();
}
Running this confirmed that my RowKey values would be evenly distributed across the 10 partitions- my concern here was that the Guid algorithm might not be quite up to the task but all seems good.
6500ms for a million rows doesn’t look too bad on the face of it. I’m sure there are plenty of performance optimizations to be eeked out, but, they’ll pale into insignificance compared to a round trip to Azure storage via the load balancer. What I do need to test is that it’s not more efficient to rehash the Guid into 64 bits and then calculate the modulo. But that’s for another night- jaded now and hoping to do 100km on the roadie in the morning.
.NET | Windows Azure|Saturday, August 08, 2009 7:50:00 AM UTC||
|