Sometimes I can be a jackass about semantics. I don't always use the right words, but I should be corrected when I choose poorly. The reason we have so many of these word things is because most have outright different meanings and those that are synonymous have nuance that makes one more appropriate than another in a certain context.
I deal with a lot of large systems and many large systems are complicated. The more complicated things get, the more clearly they must be described and documented or you're left completely bewildered and confused. This brings me to a topic that annoys me to no end: database lingo.
Partitioning and Federation... they are similar, but different.
A partition is a structure that divides a space into two parts. Multiple partitions can break up that space into an arbitrary number of parts. In computer operating systems, this even has a more specific definition referring to the division of resources into portions. As a verb it means to divide something (typically a space) into small pieces.
A federation is a set of things (usually states or regions) that together compose a centralized unit but each individually maintains some aspect of autonomy. In computer systems this is often applied to security systems where several autonomously operating systems providing security to a certain set of users or over a certain set of facilities together provide a consistent and complete security infrastructure. In databases, it means that several databases hold information, but certain instances are completely responsible for different portions of the data commonly based off characteristics of the data itself.
So, how are these different? It's subtle on one level as they both describe methods dividing datasets into smaller parts. Federation is typically across machines. Federating data on a single machine is an inappropriate use of the term. Federation more often applies to schemes that divide on logical boundaries, such as the geographic definition above. The Internet is more global, so lets think of countries instead. If we were to take each country and design our systems such that all data related to each country existed on a different server, we have a geographically federated systems. Another common (and practical) example is federating based on quality of service (paying users vs. free users). The motivation behind this is clear, it makes the task of ensuring service levels on the database easier because the data set is smaller and it allows one to prioritize the investment to improve an aspect of the system because of the logical separation (e.g. more immediacy and money can be applied to ensuring availability of the servers that service paying users.)
Partitioning is a more general concept and federation is a means of partitioning. Partitioning can be applied to databases at many levels. One common use is taking a single large table and splitting it into parts in order to place those parts that are accessed more frequently on faster (more expensive) storage. However, partitioning isn't limited to a single machine. That partitioning schema was to allow use of more than one (and even a different type/cost) disk spindle. It can also be applied to multiple database instances; it is a loose term. However, partitioning does not imply a logical separation. It is often used to simply split our data up so that more hardware can be leveraged to process it. Google's information, for example, is partitioned all over the place and then they ask all the system components (servers) to participate in answering questions via their "map and reduce" system. Some partitioning schemes require mapping questions across many nodes and some partitioning schemes provide a priori knowledge about which components hold what data allowing more targeted questioning.
The techniques for choosing on which component to store a particular piece of data are wildly varying, each with its own advantages and disadvantages. Understanding how you will be storing data and more importantly what questions you will be asking over the data set dictate the partitioning scheme that is most appropriate. Sometimes federating is right, other times a more generalized partitioning scheme is more suitable.
This brings me to my last point, and the motivation for this post.
A shard is a piece of broken ceramic, glass, rock (or some other hard material) and is often sharp and dangerous. Sharding is the act of creating shards. Somehow, somewhere somebody decided that what they were doing was so cool that they had to make up a new term for what people have been doing for many many years. It is partitioning... sometimes that partitioning is proper federation. You don't need a cool name to effectively accomplish what's been around for a long time. Moreso, you don't need a name that implies you broke something irreparably.
Sunday, September 9. 2007 at 01:09 (Reply)
As I understand it, sharding really is different than database partitioning. With partitioning, you traditionally split up your logical data model into parts and put each of the data partitions on different servers. One logical record in a partitioned database usually has its information stored across each of the partitions.
Sharding differs from data partitioning in two ways. Firstly, each database server is identical, having the same table structure. Secondly, the data records are logically split up in a sharded database. Unlike the partitioned database, each complete data record exists in only one shard (unless there's mirroring for backup/redundancy) with all CRUD operations performed just in that database.
You may not like the terminology used, but this does represent a different way of organizing a logical database into smaller parts.
Sunday, September 9. 2007 at 11:57 (Reply)
The problem is that your description of multiple identical database each with identical table structure is just partitioning. Partition does not imply that one logical record is split up. If all your servers are identical and you split your records up so that the application knows where they reside, that partition is called federation.
I believe that MySQL is pushing the term sharding because they decided to create a storage engine called FEDERATED and now push a different term so as not to confuse their clients.
Wednesday, January 7. 2009 at 16:07 (Link) (Reply)
Along the same logic, PARTITION has had a very specific meaning in some database systems (Oracle) for years. It specifically means to separate the data within a table. Also, we partition disks where we separate the data on a single disk.
You are right that "sharding" has a context of breaking irrevocably, and that is appropriate. Sometimes an application can have data that can be separated never to be joined again. If data from one account never needs to interact with data from another account then they can reside on separate databases (shards).
Federation has the implication that the pieces can be put back together again. Sharding does not.
Sunday, September 9. 2007 at 06:42 (Reply)
Hm, I've only heard "federated' systems applied to systems like smtp email where each system is responsible for its part of the world and has to communicate with other systems over a well-established protocol to interact with their part of the world.
Rather different from partitioning where normally the partitions don't interact with each other at all. The client either has to know which partition it needs to work with or there's some other mechanism for propagating data to other partitions.
The first time I heard of "sharding" was only a few days ago and I'm also unclear why there's any distinction between it and partitioning.
Monday, September 10. 2007 at 10:16 (Reply)
Let's not forget about "sharting" either
Monday, September 10. 2007 at 15:05 (Link) (Reply)
Actually, a piece of broken ceramic is called a "sherd." At least to archaeologists. ;-)
—Theory
Thursday, September 27. 2007 at 20:50 (Link) (Reply)
( I may be responsible for introducing this term) a shard of data is a term used to describe the similar properties of both partitioning and federation when the data layer cannot be classified solely as either-especially in the case when the data is converted from a centralized location to fragments of data across the entire pool.
For instance you have a traditional master many slave topology. Classifying certain fragments of data to exist on certain servers and removing the centralized location for reads and writes is the process of sharding the data-since this is both partitioning and federating data.
Saturday, September 29. 2007 at 17:04 (Link) (Reply)
Thanks for making my point. What you describe is partitioning. That someone felt a need to introduce a new term for something that is well described is the heart of the frustration that motivated the post.
Thursday, October 11. 2007 at 06:07 (Link) (Reply)
Sadly, there isn't enough volume to have a Google Trends, but still as of 10/11/2007 :
1,850,000 for partitioning database.
374,000 for sharding database.
No doubt it will be successful !
It is funny, because I am writing a presentation about J2EE vs. PHP vs. Ruby. Showing that it is more about culture than technology. I wanted to mention "Sharding" as an innovation for scalability, that doesn't depend on PHP or JAVA or any specific Database.
Now the point goes beyond that, it illustrates the buzzword factory that web development is about. If you are from a traditional database culture your doing horizontal partitioning, if you are from a Web cultur, it is "Sharding".
A whole lot more sexy, hype and innovating !! Well, it introduce some confusion but it is the price for success :
- http://www.hibernate.org/414.html
- http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9017778&pageNumber=2
I quote : "Sharding usually involves divvying up data onto different physical machines. Partitioning, in contrast, typically occurs on the same piece of hardware" :-)
- http://www.datacenterknowledge.com/archives/2007/Apr/27/database_sharding_helps_high-traffic_sites.html
Anyway, thanks for your information and don't let the Shards bring you too much frustration.
Thursday, June 5. 2008 at 09:26 (Link) (Reply)
Special thanks for that last sentence, Theo :-)
Tuesday, January 13. 2009 at 04:01 (Link) (Reply)
The difference between partitioning and sharding is that sharding applies specifically to the technique of horizontal partitioning, whereas partitioning itself could be either horizontal or vertical. The term sharding is slightly more specific.
The tech industry is full of nomenclature like this. It's important that we define it as doing so helps us to communicate better, even if we just decide that two terms are in fact the same.
Sunday, May 24. 2009 at 03:05 (Reply)
It is not advantageous to add a term when we already have a term that has been used historically to describe the exact same thing. No one is saying sharding is just partitioning. Sharding is horizontal partitioning. Period. So the term sharding is not slightly more specific. In fact, it is less descriptive than horizontal partitioning. Only now we have two terms being used for the same thing, and that adds confusion, not clarification.
Friday, January 14. 2011 at 12:25 (Link) (Reply)
I like this post and I agree with you that the components of federated systems are autonomous. Autonomy does seem to be a key part of the difference. But I think a federated system can have overlapping regions and so isn't necessarily a partitioned system. Also, I think there's a slight difference between partitioning and sharding. Sharding (it seems to me) implies that the pieces play the same role or are homogenous.