Buffering and Caching

sumitnagle
Jun 23
3 min read

Last time we discussed how database use storage layout to ensure the I/O latency is minimised (based on the access patterns), and at the end we gave a short introduction of buffering. So today we will discuss on it even more! So lets get ready!

buffering and caching are essential mechanisms that optimise data access and improve performance, especially when dealing with disk-based storage. Reason is we cannot always fetch data from disk (even if orderly stored) as disk reads (even with sequential reads) still have moderate latency, for this purpose, we need storage which is magnitude faster! any guesses? 🥁🥁__drum rolls__🥁🥁 🎉 RAM🎉.

Buffering refers to the temporary storage of data in memory (RAM) before it's written to disk or after it's read from disk. Buffering is managed by the Buffer Manager in the DBMS. This is because, Disk I/O is significantly slower than memory operations, and buffering reduces the number of direct disk I/O operations, thus improving performance.

When a query requests data, the DBMS first checks if the required data is available in the buffer pool (an allocated in-memory space that holds recently accessed disk blocks (pages)).

If the data is present (a buffer hit), it is retrieved directly from memory, which is much faster than disk access.
If the data is not present (a buffer miss), the DBMS loads the required data block from disk into the buffer pool.

When a query involves inserting or modifying data, the following sequence takes place,

Write-Ahead Logging (WAL). Before making any changes to the actual data pages, the DBMS first records the intended changes into a transaction log (also called a write-ahead log). This ensures durability, so that in case of a system crash, the changes can be recovered using the log.
Modifying in-memory (buffered) pages. Once the log is safely written, the actual data modifications happen in the buffer pool pages (in-memory). At this point, the affected pages are marked as dirty pages, meaning they have been changed in memory but the changes haven't yet been written to disk.
Flushing dirty pages to disk.The DBMS doesn't immediately write these dirty pages back to disk for every change (remember bulk writes, they minimise disk I/O overhead). Instead, it follows a write-back policy, where modified pages are periodically flushed to disk, either,
- when there’s memory pressure (maybe because buffer is almost full),
- during a checkpoint operation,
- or after a certain time or number of transactions.

This entire mechanism ensures a balance between performance (through in-memory operations) and reliability (through write-ahead logging and controlled disk writes).

Note: WAL are also periodically flushed into some persistent storage. So that if the system crashes before the dirty page is flushed to disk. The WAL is used to redo the operations. The recovery process scans the WAL and reapplies committed updates that weren’t flushed.

There is another cool thing called deferred writes, where in case, same data is being updated multiple times, all these updates will still be written in transaction log and then in buffer pool. However, even with multiple updates happen on the same page, only the final version is written to disk.

Now, another things is RAM is limited, and we cannot just keep on adding pages to buffer memory, we also need to flush them, and for this, DBMS use replacement policies like Least-Recently-Used, Least-Frequently-Used, Clock algorithm (removing pages in circular (similar to LRU)).

Before we move, i want to mention, just another nuance for OS-Geeks, when we query a database engine, it first check if the required data page is already in the buffer pool (RAM). If yes, the it fetches the data from the memory, if not, then it requests the file (in which the page is present) from the operating system’s file system. The operating system then, checks if the file system cache (operating system pagecache) has the data. If yes, it returns the data to the database (faster than disk I/O). If no, it loads the file from disk causing disk I/O.

Now lets move on to caching, caching is very similar to buffering, but when we talk about caching in DBMS, we are more interested in optimising read operations. Caching can happen at multiple levels, disk cache (operation-system caching of disk data), DBMS cache (DBMS maintains its own cache for frequently accessed data, i.e. buffering) and application cache (Some applications use external caches (like Redis) to store frequently accessed data).

Thats it! we are done for today! Bye Bye!