Disk I/O in Databases
- sumitnagle
- Jun 22
- 8 min read
Updated: Jun 23
Disk I/O operations (input/output operations) involves fetching data from a disk storage (persistent storage) into memory and writing data back to the disk storage.
These operations are slow 🐢 compared to CPU and memory (though, they are not persistent) operations due to mechanical movement (in HDDs) or access latency (in SSDs) and since these disk I/O operations are orders of magnitude slower than RAM, excessive disk I/O can significantly slow down database performance.
That’s why when your favourite website 😏 takes forever to load, you feel like smashing your keyboard, thats because somewhere, a poor disk is scrambling to fetch your data at 🐢 speed!
Some people might argue that SSD are faster than HDD, however, even with SSDs, latency and throughput of disk access remain a major bottleneck in high-performance database systems.
Disk Latency Breakdown
SSDs eliminate seek and rotational latency but still have transfer time and controller overhead.
There is another important thing to emphasise that, operations requiring sequential disk reads/writes (Sequential I/O) are much faster than random access (Random I/O) because modern storage devices optimise for contiguous data access. Accessing data from different locations (random I/O) on the disk, requiring frequent head movement (HDD) or additional controller operations (SSD). However if data is read/written in continuous blocks (sequential I/O), the operation is much faster (10x–100x improvement over random I/O).
Now we know, whats problem, and lets see how databases are engineered to minimise disk I/O and make it as efficient as possible. This requires a combination of strategies, efficient storage layouts, indexing strategies, caching mechanisms, compression techniques, and distributed architectures.
Wait wait! in this blog, we will only discuss the storage layout section, and we can discuss these other optimisation in some later blog (i don't want to make this too long, i know, you guys will definitely sleep 😴).
Before we talk about storage layout, it’s important to understand how storage works in a database.
A database engine is responsible for efficiently storing, managing, and retrieving data. One of the fundamental mechanisms it uses is page-based storage. Here, the data is stored in page (also called as block), smallest unit of storage, typically 4KB to 16KB in size. And the page are grouped into pages (database-specific, but often 8KB–64KB), where as these pages are stored in files (data files managed by the database).
Alongside data pages, there are separate files for indexes, metadata, and transaction logs. This structure allows efficient storage, retrieval, and management of data on disk.
A page is itself a data structure used by databases to store and manage data efficiently. It has a well-defined structure that includes metadata, actual data, and sometimes indexing information. A page contains multiple rows or columns of data (this is where storage layout comes, more about it later), along with metadata and pointers for navigation.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define PAGE_SIZE 4096 // 4KB fixed page size
#define HEADER_SIZE 64 // Reserve space for metadata
#define MAX_RECORDS 100 // Maximum records per page
typedef struct {
int page_id; // page identifier
int record_count; // number of records
int free_space_offset; // offset to next free space
int record_offsets[MAX_RECORDS]; // array to store record positions, contains pointers (offsets) to actual rows in the `data` section. This helps in fast access and logical ordering of records.
char data[PAGE_SIZE - HEADER_SIZE]; // Stores actual data records.
} Page;
Different databases store data in different physical file formats. For example, MySQL (InnoDB engine) stores data in .ibd files (InnoDB tablespaces). PostgreSQL uses base/ directory where each table is stored as separate .data files. MongoDB stores BSON-encoded documents inside .wt (WiredTiger) files.
Internally, all modern relational and non-relational databases store and manage data using pages, even though the physical file formats differ across databases.
Databases read and write in fixed-size pages (typically 4KB – 16KB). Searching larger pages, reduce seek time but increase memory waste, where as, smaller pages, reduce memory footprint but increase disk fragmentation. So usages requiring multiple write operation should use smaller page sizes, where as, usages requiring large amount or read operations should use larger page sizes.
Wait Wait, I know we are moving fast, but the are something which i need to point out (for those OS-geeks) !! First, a DBMS engine runs on top of an operating system and relies on the operating system for memory management, disk I/O, process scheduling, and other system-level operations. In most of the case DBMS engines do not interact directly with the disk in most cases (Some databases use direct disk access techniques to optimise performance.). Instead, they use the operating system's file system to read and write data, i.e. DBMS data files (inside which database’ page resides) are stored in the filesystem managed by the operating system.
The DBMS stores data as files in the operating system (for example, .db, .frm, .ibd, .ndb files). It uses operating system system calls (like read(), write(), mmap()) to access these files and the operating system handles actual disk I/O, caching, and memory mapping.
Second, a page in a database management system (DBMS) is different from a page (a virtual memory management concept) in an operating system (OS), although they share some conceptual similarities. A page (in operating system) is a fixed-size block of memory used in virtual memory management. It is a unit of data transfer between RAM and disk storage. Whereas, a page in DBMS is a logical concept, and represents the smallest unit of data on which storage and retrieval is done. It is used for efficient data organisation and indexing. These pages are used for storing records, indexes, and metadata.
Now the actual part, we know how data storage is done on disk using pages, but now comes the cool part, how do we put data in this page itself.
Efficient Storage Layout
Databases use different storage layouts to optimise for different workloads. Databases organise data in two main orientations, row-oriented and column oriented. Each of these layouts has advantages and trade-offs depending on the type of queries, data access patterns, and performance requirements.
Question! Why we want to optimise the storage layout? Since disk I/O is one of the major bottlenecks in database performance (discussed above), designing a storage layout that minimises read/write operations is essential. Optimising storage layout helps in reducing fragmentation (ensures sequential access), reducing random disk seeks, improving cache locality, and enhancing overall database performance.
Row-Oriented Storage
In this layout, entire rows of data is stored together in contiguous blocks. Each row consists of multiple columns, and the data is stored sequentially. For example, say we have multiple rows of data,
This data, in a row-orientation database, will be stored in a sequence (data is stored row-by-row),
101, Alice, 30, HR, 50000
102, Bob, 28, IT, 70000
103, Charlie, 35, Finance, 60000
When stored, the data is written as binary inside disk page. If a row is too large (more than the page size), it may span multiple pages. When we query,
SELECT * FROM employees WHERE Employee_ID = 102the database loads an entire page into memory.
🌟 Row-oriented storage is optimised for transactional workloads (OLTP), where frequent reads and writes involve entire records (entire row) (i.e., all columns of a row). This layout enables efficient inserts, updates, and retrievals when accessing full rows, as the data is stored contiguously on disk.
However, this kind of storage layout is inefficient for analytical workloads (OLAP) that require selecting a subset of columns from large datasets. The inefficiency arises because when querying only specific columns (for example, fetching only the "Name" column from a large dataset), the database must scan entire rows, locate the relevant column values within each row, and discard unnecessary data. Since the entire row must be read and parsed—even for a single-column query—this results in higher I/O overhead and reduced query performance.
In row-oriented databases, the smallest chunk in which data is stored is called a tuple (row). Tuple is a logical unit that represents a database record in a database table. Multiple tuples are stored within a page. i.e. internally, these rows are organised into page, which are the smallest units of storage on disk. This is similar for column-oriented databases (discussed later), where instead of rows, multiple columns (called column family) are organised into pages.
From this structure we can intuit that, this layout is good for cases where we have queries involving the whole row,
SELECT * FROM employees WHERE Employee_ID = 102;and inefficient for queries involving subsets of column such as,
SELECT AVG(Salary) FROM employees;So what does it means we can never have optimisation in queries involving subset of column, the answer is column-oriented storage.
Column-Oriented Storage
In this layout, data is stored column-wise instead of row-wise. Each column's values are stored together. Instead of storing rows sequentially, columnar storage organises the same data (we discussed above) as,
101, 102, 103
Alice, Bob, Charlie
30, 28, 35
HR, IT, Finance
50000, 70000, 60000
Each column gets its own page. i.e. there is no mixing of columns within a page. As column gets its own page. So, if we query,
SELECT AVG(Salary) FROM employees;only that specific page is loaded, however if we insert a new row, it requires updating multiple column pages separately, making writes slower.
As data is stored column by column, this makes it efficient for sequential access for analytical queries (i.e. ideal for analytical workloads, where operations involve aggregations on a few columns).
Queries like,
SELECT AVG(Salary) FROM employees;is faster as Salary data is stored together, the query can be processed faster as, DBMS only need to load the Salary column which is stored sequentially in column-oriented.
Columnar databases struggle with frequent updates/insertions and joins because related data is scattered across multiple files. Since each column is stored separately, inserting a full row, such as,
INSERT INTO employees VALUES (104, 'David', 32, 'HR', 55000);is inefficient, as writing a new row requires updating multiple column files, which can result in higher write amplification (i.e., multiple storage locations need to be updated for a single insert).
Tiring right! So lets go to conclusion!
row-oriented storage, efficient for operations involving entire records, such as getting rows with a condition, its also efficient and fast for new row insertion. In case of column-oriented storage, its efficient for operations involving subsets of columns, such as getting row’s values for a specific column, however new row insertion is slow, because, there is column separation (and these column might be stored on different part of disks).
If our workload involves frequent inserts, updates, and deletes (typical in OLTP systems), row-oriented databases like MySQL, PostgreSQL, and MongoDB are more suitable. These databases store entire rows together, making transactional operations efficient. On the other hand, if our workload is focused on analytics and reporting (OLAP), columnar databases like Apache Parquet, Apache ORC, ClickHouse, Amazon Redshift, and Google BigQuery are preferred.
You might think does OLAP system don't required data insertion? obviously yes! However, the nature of these inserts is quite different from OLTP. While OLAP databases still require inserts, they are optimised for bulk inserts and batch processing rather than frequent small transactions. i.e. OLAP use case don’t have frequent small transactions, instead they have occasional bulk transactions.
Now the question is, can this be improved further? I mean, despite being row or column oriented database, we are still reading from disk! Can it be improved further? The answer is YES!! For that please check this blog.