Let's reimplement the wheel... or at least another GCS.
What's a GCS you ask? Good question. GCS stands for Group Communication System. Simply put, it is a system that allows seperate, possibly distributed, processes to communicate with each other. How is that different from MPI or TCP/IP or XYZ? Well, the big difference is the added concept of "groups."
Group Communication Systems allow processes to join groups and communicate. Each member of the group is presented with the current membership and when messages are sent to a group, they are delivered in a specific membership of that group. What's more is that everyone participating agrees on the membership in which each message was delivered. So, when I send a message to a group and you are a member, I am not assured that you will receive it, but I am assured that I will know if you did or did not receive it -- I (and everyone else who got the message) will know exactly who received it and we'll all agree. This is extended virtual synchrony and it is quite powerful.
So, why does the world need this? It is used to build better, stronger technology. Too often do we use systems like JMS or other durable messaging systems (which are great) to solve problems for which they are unsuited. Durable messaging systems guarantee that when I send a message to a group of people that they will all eventually get it. However, if I know that a specific user did not get the message I can change my plans -- if I can change my plans and compensate for a missing member, then I can avoid the exorbitant cost of that guarantee. This provides just enough plumbing to develop fast, efficient and robust replication systems.
There are a few GCS systems out there: OpenAIS and Spread are the two with which I am most familiar. However, they are lacking some key features. OpenAIS is licensed in a fasion that I could contribute if I wished, but I think the project goals are not quite in line with mine.
I propose Anasazi, the next generation of highly-scalable distributed systems infrastructure. I'm looking to form a group of talent developers to build this system under a BSD-style license. The main goal (aside from function) is the suitability to run in high-uptime, mission critical environments.
Here's what I see missing from the world of group communication systems:
- Robust configuration management and dynamic reconfiguration.
- Good, reliable client libraries that are low-overhead, non-blocking and truly MT-asyncsafe.
- Excellent error reporting and diagonostics -- distributed systems are hard to troubleshoot.
- Online internals introspection -- effectively having a APIs for deep data introspection and manipulation to aide online troubleshooting.
- SNMP instrumentation
- provide a "console" for realtime querying and manipulation
- leverage existing scheduling APIs for event-driven systems to ease integration directly into other systems.
- Dual purpose the core codebase to allow for stand-alone daemon operation, integrartion into an event-driven appliation's mainloop, or integration into an application via a separate thread.
- Offer a client API to be both inprocess and workable via IPC (in addition to sockets)
- A high-speed (optional) symmetric crypto layer and an (optional) integrity layer that is fast and simple to supplement the Secure Spread work for those who do not require strong guarantees (such as perfect forward secrecy).
- The ability to break EVS across the whole system into different sub-EVS systems. This would allow low-speed, high-latency participants to form groups that will not slow other groups high-speed, low-latency members and stringent QoS requirements.
Any takers? (note: I'm being a tad selective)