CAP Theorem
- sumitnagle
- Jul 6
- 4 min read
Updated: Jul 6
The CAP Theorem (Brewer's theorem), states that a distributed database system can only guarantee two out of the following three properties, consistency, availability and partition tolerance. Ok, let first understand what these terms even mean!??
Consistency
Consistency literally means "being consistent", technically it means that every read from a distributed system always returns the most recent write (or an error if that’s not possible). In other words, it ensures that all nodes see the same data an any point in time, and after a successful write, any subsequent read from any node will return that latest written data, this is achieved by synchronising any updated data across all nodes before any read is allowed.
Based on how "much" a distributed database system provides Consistency, we can divide them into two different categories,
Strong Consistency, where the system guarantees that after a write, all reads return the latest data, and until the latest write is fully propagated, all reads are blocked. I know i know, system designer!! what you guys thinking are correct, this will have higher latency!
Eventual Consistency, guarantees that eventually, all nodes will have the latest data, and reads can return stale data until updates are propagated, means there will be a wait for synchronisation! thought low latency.
Availability
Availability for a database means that any request (read or write operation) made to the database, receives a response back, even if some nodes are down, i.e. the system is always available for response (which could be the most recent data or stale data) and never rejects a request or returns an error unless there's a total system failure.
Availability ensures that the system is always operational and responsive, even during partial failures.
Note: when we refer to Availability, we meant by availability of the whole system (constitutes of many nodes) and not just availability of individual node.
Ok! Genius! You might be realising now that these two are nemeses, if a system ensures strict Consistency, it can’t always stay available, and if it prioritises Availability, it might serve outdated data from a node that hasn’t received the latest update.
Heads up, you are thinking somewhat right! But even in the above case, database system can still act intelligent and still do best while providing both Consistency and Availability. But this is true till the whole system is healthy, meaning all nodes are working fine. However things starts getting weird, when these nodes start getting disconnected with each other, and this what we are going to discuss now,
Partition Tolerance
Partition Tolerance refers to tolerance of a distributed system to continues to operate correctly even if there is a communication breakdown (network partition) between nodes.
Network partition occurs when nodes in the system cannot communicate with each other due to failures like network issues, server crashes, or delays. If a node doesn't respond within a specific time, it's considered unreachable, this done by periodically sending heartbeat signals. If one stops responding, it may be marked as partitioned. This partition occurs when a distributed system is split into two or more disjoint groups of nodes (or partitions) that, can still communicate within their own group (internal communication is fine), but cannot communicate with nodes in the other group(s) (external communication is broken), though despite the partition, each group of nodes might continue to function independently.
For example, say we have a distributed database with 5 nodes, A, B, C, D and E and say, due to a network failure, the nodes get split into two groups, partition P1 with nodes A, B, and C and partition P2 with nodes D and E. Now within each partition, they can communicate with each other as well as with the client. i.e. nodes A, B, and C can communicate with each other, and nodes D and E can communicate with each other, however, nodes in partition P1 cannot communicate with nodes in partition P2, and vice versa. Each of these partition may continue to accept reads and writes independently. However, they lacks synchronisation. Once the network issue is resolved, the system must reconcile the data (as they can have write mismatch) and resolve conflicts, which is know as conflict resolution.
Now we are getting serious,
🌟🌟Partition tolerance is non-negotiable in any distributed systems due to the inherent possibility of network failures, and without partition tolerance, the entire system could become unavailable if any node becomes unreachable, that's why partition tolerance is a must for any distributed system. Meaning, a distributed database system must always be partition tolerant. However in presence of a network partition (and still being partition tolerant), this system must choose between,
Consistency and reject requests that cannot be guaranteed to be consistent. Or,
Availability, and accept requests but risk returning stale or inconsistent data.
This is what CAP theorem is! Now lets prove this!
Let’s assume, we have a distributed database system with two database nodes, A and B, and both nodes hold a copy of the same data. And the system claims to provide Consistency and Availability. Now imagine a network partition happens (meaning, they can no longer communicate with each other, however clients can still reach both nodes independently).
Now, say a client sends a write request to A and another client sends a read request to B. Now what should B do?
If B responds immediately (providing Availability), it may serve stale data, because it hasn't seen the latest write on A which is violating Consistency.
If B refuses to respond until it can check with A (to remain Consistency), it will be unavailable during the partition, this is violating Availability.
Therefore, during a network partition, we can’t have both Consistency and Availability. We must choose one.
Only a centralised database (single node database) can be both Consistency and Availability, as there is nothing like partition tolerance.
MongoDB favours Consistency and might deny writes if it can't confirm consistency across nodes and it will prioritise data correctness but reduce availability. Whereas, Cassandra favours Availability and system will continue to accept writes and reads on available nodes and it ensures availability but risks temporary inconsistency.
Thats it! we are done! But wait, its very important 🌟 clarification that i want to tell you! This CAP theorem tells us the trade-off between Consistency and Availability only in case when there is a network partition. And in a normal healthy operation (no partition), a distributed database can provide both Consistency and Availability, just like a centralised database. Noted right?