Partitioning vs. Federation vs. Sharding

Sometimes I can be a jackass about semantics. I don't always use the right words, but I should be corrected when I choose poorly. The reason we have so many of these word things is because most have outright different meanings and those that are synonymous have nuance that makes one more appropriate than another in a certain context.

I deal with a lot of large systems and many large systems are complicated. The more complicated things get, the more clearly they must be described and documented or you're left completely bewildered and confused. This brings me to a topic that annoys me to no end: database lingo.

Partitioning and Federation... they are similar, but different.

A partition is a structure that divides a space into two parts. Multiple partitions can break up that space into an arbitrary number of parts. In computer operating systems, this even has a more specific definition referring to the division of resources into portions. As a verb it means to divide something (typically a space) into small pieces.

A federation is a set of things (usually states or regions) that together compose a centralized unit but each individually maintains some aspect of autonomy. In computer systems this is often applied to security systems where several autonomously operating systems providing security to a certain set of users or over a certain set of facilities together provide a consistent and complete security infrastructure. In databases, it means that several databases hold information, but certain instances are completely responsible for different portions of the data commonly based off characteristics of the data itself.

So, how are these different? It's subtle on one level as they both describe methods dividing datasets into smaller parts. Federation is typically across machines. Federating data on a single machine is an inappropriate use of the term. Federation more often applies to schemes that divide on logical boundaries, such as the geographic definition above. The Internet is more global, so lets think of countries instead. If we were to take each country and design our systems such that all data related to each country existed on a different server, we have a geographically federated systems. Another common (and practical) example is federating based on quality of service (paying users vs. free users). The motivation behind this is clear, it makes the task of ensuring service levels on the database easier because the data set is smaller and it allows one to prioritize the investment to improve an aspect of the system because of the logical separation (e.g. more immediacy and money can be applied to ensuring availability of the servers that service paying users.)

Partitioning is a more general concept and federation is a means of partitioning. Partitioning can be applied to databases at many levels. One common use is taking a single large table and splitting it into parts in order to place those parts that are accessed more frequently on faster (more expensive) storage. However, partitioning isn't limited to a single machine. That partitioning schema was to allow use of more than one (and even a different type/cost) disk spindle. It can also be applied to multiple database instances; it is a loose term. However, partitioning does not imply a logical separation. It is often used to simply split our data up so that more hardware can be leveraged to process it. Google's information, for example, is partitioned all over the place and then they ask all the system components (servers) to participate in answering questions via their "map and reduce" system. Some partitioning schemes require mapping questions across many nodes and some partitioning schemes provide a priori knowledge about which components hold what data allowing more targeted questioning.

The techniques for choosing on which component to store a particular piece of data are wildly varying, each with its own advantages and disadvantages. Understanding how you will be storing data and more importantly what questions you will be asking over the data set dictate the partitioning scheme that is most appropriate. Sometimes federating is right, other times a more generalized partitioning scheme is more suitable.

This brings me to my last point, and the motivation for this post.

A shard is a piece of broken ceramic, glass, rock (or some other hard material) and is often sharp and dangerous. Sharding is the act of creating shards. Somehow, somewhere somebody decided that what they were doing was so cool that they had to make up a new term for what people have been doing for many many years. It is partitioning... sometimes that partitioning is proper federation. You don't need a cool name to effectively accomplish what's been around for a long time. Moreso, you don't need a name that implies you broke something irreparably.

Comments

comments powered by Disqus
Copyright © 2013 - Theo Schlossnagle - Powered by Hexo
- Ported theme GreyShade -