Very good blog from Stephen Foskett (thank you) on the how Dropbox stores data (some de-duplication here)
Single Instance Storage
It’s fairly simple for a system to eliminate duplicate data by storing only a single instance of multiple identical files. In other words, if you and I both upload “Presentation.pptx” and it’s bit-for-bit identical, it would be a simple matter to store just one copy.
Dropbox definitely does this. I proved it with a simple experiment:
- Create a new 10 MB encrypted disk image in TrueCrypt (so it’ll be 100% unique, random data)
- Move it to the Dropbox folder and wait a few minutes as it uploads
- Copy the file with a new name to the folder and notice that it “uploads” instantly
Dropbox is at least single-instancing storage. This helps users, since it speeds uploads and reduces bandwidth usage. It helps Dropbox in the same way, but goes further since they still “charge” files against your account whether they’re single-instanced or not.
Clashing MD5 Hashes?
A global single-instance storage system sounds great, but it opens the door to hash collision issues. Imagine if you and I both uploaded identical files. Both would have the same “fingerprint” and Dropbox would only store it once. Now imagine instead that, out of coincidence or malice, I uploaded a file with the same fingerprint as yours but different contents. This is not so far-fetched as it seems, and could lead to all sorts of security nightmares. Read on here