Compression and Encoding
- sumitnagle
- Jun 24
- 3 min read
Compression techniques are essential in data storage and processing, as they reduce disk usage and enhance I/O performance. There nothing much tell apart from this, so we will directly hop on to the different techniques that we can do to reduce the data storage size.
Dictionary Encoding
This technique replaces repeating values with references to a dictionary, which stores unique values. By substituting repeated occurrences with shorter references, storage space is conserved. This method is particularly effective for columns with low cardinality (such as enums) where the number of unique values is significantly smaller than the total number of entries. For example, in Apache Parquet, dictionary encoding assigns an integer identifier to each unique value, and columns are stored as sequences of these identifiers.
["PAYMENT_UPI", "PAYMENT_UPI", "PAYMENT_CC", "PAYMENT_UPI", "PAYMENT_CC"] -> [0, 0, 1, 0, 1]
as this above dataset have low-cardinality with two distinct value, we can simply map the to integers, for example, PAYMENT_UPI -> 0 and PAYMENT_CC -> 1, and data can be encoded and stored as mentioned on the right.
Run-Length Encoding
RLE efficiently stores sequences of repeated values by recording the value and its run length. It's especially effective for data with long runs of identical values. For example, in image compression, a sequence like "AAAAAAABBBBCCCD" can be encoded as "7A4B3C1D", indicating the value and its frequency.
["AAABBC", "AABBBCCCC", "AABCCC"] -> ["A3B2C1", "A2B2C4", "A2B1C3"]
as this above dataset have long run of same character, using run-length encoding could be useful, and this data can be encoded and stored as mentioned on the right.
However, RLE is less effective for data without such patterns, as it might increase the data size. For example,
"ABCDEF" -> "A1B1C1D1E1F1" or "ABCDEF" (count 1 represented by no number)In both the cases, there is no advantage of using RLE.
Delta Encoding
Delta encoding stores the differences between consecutive values rather than the values themselves. This approach is beneficial for data with small or consistent variations, such as time-series data. For example, instead of storing the sequence "100, 105, 110, 115", delta encoding would store "100, 5, 5, 5", with the first value as a reference and subsequent values representing differences. This reduces the range of numbers, potentially leading to better compression.
[100, 105, 110, 115] -> [100, +5, +5, +5]
Delta encoding is particularly effective in time-series databases, where data points are sequential and differences between consecutive values are minimal. Time-series databases like InfluxDB and TimescaleDB use delta encoding for efficiently storing sequential time data.
Bit-Packing
Bit-packing minimises storage overhead by using the minimum number of bits necessary to represent data. For example, if a dataset contains integers ranging from 0 to 7, each value can be represented using only 3 bits instead of the standard 8 bits for a byte. This technique is particularly useful when dealing with large datasets of small integers, as it significantly reduces the storage requirement.
Conclusion
Just a head up! there are definitely many other compression techniques, but for this blog, we have shown only the famous ones! So, serve yourself and google for more.
Also, these compression methods can be combined to achieve higher efficiency. For instance, Apache Parquet employs a hybrid approach that integrates dictionary encoding, RLE and bit-packing. This combination allows for efficient storage of repeated values by choosing the most appropriate encoding strategy based on the data's characteristics.