Durability Mechanisms
- sumitnagle
- Jun 28
- 8 min read
So we have discuss a lot things around data, buffering, indexing, transactions, concurrency control, but all of this doesn't make any sense if database cannot store the data durably! Imagine winning a $1M lottery, dreaming of your McLaren all night, and waking up to find $0 in your account because your bank’s developer forgot to ensure durability in their database.
Durability ensures that once a transaction is committed, its changes are permanent, even in the event of system failures, power loss, or crashes. Ok! let jump directly into the different techniques which are used to ensure durability!
Write Ahead Logging
I think you must have have heard this term a lot of times by now! So now lets actually jump into it!
Write-ahead logging at its core, logs the changes before applying them to the actual database. This ensures that if a system failure occurs, the database can recover to a consistent state by replaying the log.
First and foremost, this write-ahead logging is relevant only for write operations (INSERT, UPDATE, DELETE and so on), this is because WAL’s job is to record changes to the database before they’re written to disk, ensuring durability and crash recovery. And for read operations, database only require buffer pool (if present in memory) otherwise data files present in disk.
WAL mechanism involves multiple components,
log buffer, an in-memory buffer where log records are temporarily and sequentially (append-only) stored before being flushed to disk.
data buffer (this is what we refer as, cache or buffer pool), an in-memory buffer where the actual data changes are stored temporarily before being flushed to disk.
WAL log file, these are persistent files contains the log records written on disk. Yeah! even log files are also persisted on disk (durability remember!).
Data file, these actual database files on disk where data records are stored.
These WAL logs from log buffer is flushed (persisted into disk) at many points,
On every COMMIT. When an operation commits, its WAL records (including the COMMIT record) are flushed from the log buffer to disk immediately to guarantee durability.
Periodically. In background, databases may also flush WAL to disk periodically (for example, via checkpoints or background writer processes) to limit recovery time, by flushing all dirty WAL and data pages up to a certain point.
Some critical operations (like CREATE DATABASE, DROP TABLE, and so on) force an immediate WAL flush too.
Log buffer gets full. When in-memory WAL buffer reaches its size limit, database proceeds for WAL log flush.
Now for any operation, changes happens first stored as log record in log buffer immediately and then in buffer pool (data pages stored are in memory). On flush operation (as mentioned above), this log buffer is flushed to WAL log file on disk. And later buffer pool's data pages are written to data files on disk (such as, on checkpoint or eviction), this is called as checkpointing.
This logging in log buffer is done sequentially (append-only), which also makes it faster than random writes.
The actual data changes from the buffer pool are written to the data file lazily (background process or during checkpoints). i.e. Periodically, the DBMS creates a checkpoint, and all dirty pages (modified data not yet written) in the buffer pool are flushed to the data file. This reduces recovery time.
Now it obvious that, there could be crashes at any step of this whole process, such as,
Before WAL log record in memory (Before 2). In this case, data is not at any place, not even in-memory, so in this case nothing need to be done, so its safe. However, application layer should handle it.
After WAL log in memory (2), but not in buffer pool (3). The data is only in log buffer, which is lost on system restart. So nothing need to be done as no data in buffer pool or data file.
After both in memory (Before 4). The data is only in the in-memory, which is lost on system restart.
After WAL flushed to disk (4), however data still in buffer pool. Now on boot up, WAL will perform undo-redo operations.
After the data from buffer pool is flushed to data files. Everything is persisted and system on boot up will start from valid state.
So the recovery is mostly required when system crashes before the data from buffer pool is flushed in data files.(i.e. between 4 and 5) and that recovery is done via WAL log records. So thats what is! we will now see how recovery happens!
Recovery
We discussed multiple points at which WAL log record are persisted, and we can clearly see that these persisted log files can contain COMMITTED (changes which should be persisted) and UNCOMMITTED (change which should be rolled back) data.
Now how even these operation make their changes COMMITTED? For that, there is a subtle difference between normal operation and transactional operations,
Both of these type of operations will write log records of their changes, however in a transaction, changes stay uncommitted unit a COMMIT log record is written where as normal (standalone) operations are auto-committed. For example for this normal operation,
UPDATE accounts SET balance = balance - 500 WHERE id = 1;There are actually two logs recorded, UPDATE log record (written immediately), followed by COMMIT log record (written immediately after, because it's auto-commit).
Considering transaction,
BEGIN;
UPDATE accounts SET balance = balance - 500 WHERE id = 1;
-- (maybe other operations)
COMMIT;In this case, UPDATE log record is written immediately, however, COMMIT log record is recorded only when COMMIT command is issued, and till then it remains UNCOMMITTED.
Did you notice one thing! normal operations are always committed, whereas transaction will remain un-committed until the COMMIT; command is invoked.
Ok we know what is COMMITTED and UNCOMMITTED data, but how does it affect recovery?
For COMMITTED. On recovery, the Redo operation need to be done, meaning re-apply all the changes including update the buffer pool (with the updated changes).
Redo phase, ensure that all committed changes are applied to the database. The system reprocesses all log records from the last checkpoint. Even if a record seems to have been applied, the recovery algorithm confirms that each change is reflected on disk. This phase is idempotent, meaning applying a redo more than once does not adversely affect the system.
For UNCOMMITTED. On recovery, the Undo operation need to be done. meaning removing all the changes which were done by never committed.
Undo phase, roll back the effects of uncommitted transactions. Using the transaction table, database walks backward through the WAL, undoing actions of transactions that were active at the time of the crash.
This is the reason, database consider an transaction to be committed the moment its flushed in log file! because we know the database can recover everything (redo operation) even in case of system failure.
And how does WAL does that? In essence, each log record contains enough detail to redo or undo the operation after a system failure. And it may contain informations such as,
{
"lsn": "0/16B6A00", (A unique, increasing number marking the position of this record in the WAL)
"transaction_id": 45321, (Identifying which transaction made the change)
"operation": "UPDATE", (type of operation)
"relation": "public.customers", (table)
"page_id": "23", (page identifier)
"tuple_id": "42",
"before_image": { (old value before change)
"balance": 1000
},
"after_image": { (new value after change)
"balance": 1200
},
"timestamp": "2025-06-27T21:35:00Z"
}Checkpointing
Checkpointing is a technique used to reduce recovery time by truncating the WAL and establishing a known good state. In short, we discussed earlier that buffer pool containing dirty pages are lazily flushed to data files and the database moved to another consistent state, meaning we can simple truncate all those old WAL log records as they are no longer required.
At periodic intervals, the database system writes a checkpoint record to disk. This operation forces the database engine to flush all dirty pages (pages in memory that have been modified) to persistent storage and record a summary of the current consistent state.
Recovery (as we discussed above) only needs to process log records generated after the last checkpoint, significantly shortening the recovery time. It captures a snapshot of the transaction table and dirty page table, which provides a clear starting point for both redo and undo phases during recovery.
Checkpointing and Redo/Undo operations are both integral to database recovery, but they serve distinct roles rather than being the same mechanism or merely a “clubbed together” functionality. They interoperate within the overall recovery process to ensure database consistency and minimise downtime after a crash, yet each has its own specific purpose and implementation details.
Before we finish this blog, there is a special mention,
Shadow Paging
Shadow Paging is a durability mechanism used in databases to ensure atomicity and consistency without the need for complex logging systems like write-ahead logging. It ensures that any modifications to the database are atomic, either fully applied or not applied at all.
Shadow paging is nothing but an idea similar of copy-on-write where that data is never modified in place. Instead, when data needs to be modified, a copy of the original data is made, the modifications are applied to the copy, and only after the modifications are successfully completed is the system atomically switched to reference the new version of the data. The database maintains two sets of pages,
data pages, reflecting the current state of the database. These data pages initially reside on disk. When a transaction begins, the required data pages are loaded into memory (in the buffer pool) for processing.
shadow pages, is copy of the original data page used for modification during a transaction. It is used during transactions to safely apply modifications without affecting the original data.
When a transaction modifies a page, a copy of the data page (called a shadow pages) is created in memory. Modifications are applied to this in-memory copy (shadow pages), not the original page. Also, the original data page on disk remains untouched until the transaction is committed.
The database also maintains multiple page tables in memory, a shadow page table (S-Table) or root page table containing pointers to the committed data pages (in memory) and current page table (C-Table), at the beginning it is a copy of the S-Table, which later points to modified shadow pages.
Upon a successful commit, DBMS atomically switch the pointers of the root table to the new modified page (i.e. the shadow page), which is then later flushed to disk. If a failure occurs, the system falls back to the the original, earlier committed pages. For example, say we have page tables currently point to,
S-Table (root table) -> [Page1, Page2, Page3]
C-Table -> [Page1, Page2, Page3]When a new transaction starts, DBMS creates copies of the pages to be modified.
S-Table -> [Page1, Page2, Page3]
C-Table -> [Page1, Page2 (exact copy), Page3]And now the modifications are made to the copied pages in the C-Table and S-Table remains unchanged.
S-Table -> [Page1, Page2, Page3]
C-Table -> [Page1, Page2 (Modified Copy), Page3]Once all modifications are successful, the pointer in the root page table is atomically switched to point to the new (modified) pages. i.e. the root page table is replaced by the new current page table. This ensures the operation is atomic.
S-Table -> [Page1, Page2 (Modified), Page3]If a crash occurs before committing, the system still points to the old and previously committed pages. The uncommitted changes (in current page table) are discarded. If a crash occurs after committing, the root page table’s pointer already points to the new, committed pages and the system starts from the newly committed state.
Copying entire pages for every write is inefficient for large databases and requires double the storage during updates because of page copies, which makes it hard to manage concurrent transactions efficiently, which makes it inefficient for write-heavy system. Also, frequent copying can lead to disk fragmentation over time.