Data Deduplication Tuning

In this article we describe the options available for fine tuning data deduplication and compression.

Inline Deduplication

When blocks of data are received during a write, for each write block an attempt is made to locate a previously stored block with identical data. If an identical older block exists the new block is never written to disk. This method of eliminating duplicate blocks on the fly is also known as inline deduplication.

Data deduplication is global, which basically means the interface (FC, iSCSi etc) and the virtual disk which received do not matter for deduplication. If a duplicate block is received it is never stored even if the older block belongs to another virtual disk.

Enabling/Disabling Deduplication and Compression for a VDisk

By default when a virtual disk is added, data deduplication is enabled for the virtual disk and data compression is turned off. This however can be changed any time per virtual disk. For example data compression may be turned on but data deduplication may be turned off. Data received when deduplication is turned off however cannot be deduplicated when deduplication is turned on later. The same applies for data compression.

Byte Verification

Identification of duplicate blocks is done by comparing the fingerprint of the newer blocks with the fingerprints of the previously stored data blocks. The fingerprints are maintained in deduplication tables which reside on a disk identified as a master disk. The fingerprints are computed using the SHA-256 hash function. There is however the possibility of a hash collision. A hash collision is one where in two blocks of data containing non-identical data produce the same fingerprint when run the SHA-256 hash function. This possibility is however way too remote to be of concern but if you really need to eliminate the possibility of a hash collision byte per byte verification can be enabled for a VDisk. With this option enabled, if two blocks are considered to be identical the older block is read back from disk and the older and the newer block are compared byte per byte to ensure that they are identical.

Enabling the verify option however would slow down writes as additional time is spent in reading older blocks from disks. However with the adoption of flash/SSD based storage increasing rapidly enabling this option for a pure flash/SSD based storage