
Post by Stephen Foskett (thank you)
When I talk about the various state-of-the-art capacity optimization solutions that are now appearing in the market, the same comment usually arises: “Isn’t this the same as zip?” Or a long-tenured storage pro will remind me that, “Stacker died out a long time ago, so why is this any different?”
These are good points: The difference between traditional compression and modern data deduplication is somewhat hazy. And it doesn’t help that various implementations fall all along the spectrum from “mildly interesting” to “cutting edge!”
Over 100 years ago, Samuel Morse defined a coding scheme for text messages. He optimized the efficiency of transmission by using fewer bits for common letters (“e” and “t” are a single dot or dash, respectively) and more for uncommon ones (“w” is a dot and two dashes). Morse Code remains in active use to this day, and the concept behind it lays the groundwork for binary transmission technology used by computer systems today.
Pure data compression solutions apply Morse’s idea to a general set of data, with the core idea being that longer sequences of bits can be represented by shorter ones. Huffman Coding, a general mathematical mechanism to encode a string, allows a simple sequence of bits to represent a letter or part of an image. The compression engine replaces a longer sequence with a short tag that tells the de-compressor to substitute the correct, original sequence.
Other data compression techniques exist as well. DVD and MP3 use lossy compression engines that “throw away” data that is deemed unnecessary. The resulting output is not identical to the original content, but TV viewers or music listeners might not notice the difference.
Simple compression technology is easy to implement but limited to relatively short data sets. This makes pure compression useful but limited.
Single-instance storage (SIS) is another simple compression concept. Rather than letters or bit sequences, single-instancing looks for identical files or objects. If two people saved the exact same file, such a storage device would just save one copy of it, maintaining a pointer for consistency. Single-instance storage was a key component of Novell GroupWise in the 1990′s, and was part of Microsoft Exchange until recently.
Single-instancing is simple to implement but limited in effectiveness. Although duplicate files occur with reasonable frequency, they aren’t nearly as common as files that differ only slightly, including revisions of a similar document or presentation. This technology has had a minor resurgence as part of cloud storage services but is fairly uncommon in the enterprise today.
Data deduplication introduces a novel twist on single-instancing: Rather than looking for entire duplicate files, deduplication engines search for identical blocks within a file or data set. This makes “dedupe” much more effective in practice, since such systems may catch slightly-different files. But it is also much more difficult to implement such systems effectively, since there is significantly more data to process.
Since there is no perfect universal block size, some deduplication systems use variable-sized blocks. They evaluate a data set and determine what size to use based on best practices, similarities to other data sets, or trial and error. Some will also “unpack” bundled objects, looking for embedded media and the like.
Read on here

