As my colleague Terry Grulke pointed out earlier, there is lot of funny math used by deduplication vendors to try to convince you that their system can go fast. With our DXi systems we don’t have to hire Cirque de Soleil to generate our performance numbers. We can keep it simple because DXi systems are just really, really fast – natively.
That’s what I’m going to talk here about here – “Native” performance. That is, the capability of the DXi system itself vs. some manufactured “logical” number like the ones Terry wrote about.
Apparently, our high performance is confusing to some of our competitors. Frequently when we are up against EMC or Data Domain, we get this question forwarded to us from the prospect, with this exact wording every time (Hmmm):
“The head unit only supports X disks and the expansion array supports Y, how does Quantum guarantee Z TB per hour (according to the documentation) with so few disks?”(insert X, Y, and Z from the appropriate DXi datasheet)
Well we don’t “guarantee” performance, nobody does. But our datasheet numbers are repeatable and justifiable, and created with real data- the same data types you have – Exchange, MS-Office files, databases. No ‘special data’ that has been doctored to generate big numbers. The truth with deduplication is that YMMV, but our published numbers are achievable.
I understand why someone might wonder how we go so fast. What amuses me is that EMC hasn’t figured it out. Quantum has been helping our customers design high-performance disk storage systems for our StorNext file system for over 15 years. We’ve learned a thing or two about storage.
To do you (and EMC) a favor, I’ll briefly describe how the DXi storage architecture enables our high performance. There is a lot more to speed than spindles if you know what you‘re doing.
First you have to understand the load, and then design the storage to fit. A deduplication system encounters two very different loads:
- Data: Incoming backups arrive in big, fat, high-speed, sequential streams.
- Metadata: Deduplication involves all sorts of chopping, hashing, and indexing activity, which generates tons of small random reads and writes. In many ways the opposite of the Data load.
It’s impossible to design storage for peak performance for both loads. But it is trivial to design storage for either one in isolation. So that’s what we do in DXi – we separate the loads. There are two pools of storage – one tuned for big sequential writes, the other tuned for small random I/O. Everything from the disk type, disk size, RAID config and beyond is selected to provide maximum performance. Lots of small fast disks (or even SSDs) striped together for metadata, and big slower disks combined for the data.
Second, enterprise-grade hardware RAID controllers. They’re fast, reliable and don’t rob system RAM and CPU cycles for storage activity. Can you believe anyone is actually still using dinosaur software RAID? You would be surprised, so make sure you ask that question when evaluating dedupe appliances.
Third, it really helps to have a flexible, high-speed file system. We just so happen to have one.Like I said, it’s called StorNext. You can get OK performance out of standard Linux file systems, but we aren’t satisfied with ‘OK’. StorNext is highly tunable, flexible and insanely fast. Since it’s ours, we have used it in the DXi from the start. There are DXi models for various price points and the details of the storage hardware vary. StorNext lets us maximize the performance of every system, giving you the most bang for your buck no matter which one you buy.
There, that wasn’t so hard, was it? Of course there is more to system performance than storage. Software design plays a large part too, but without a properly designed storage layer the fastest code in the world won’t help you. I could tell you more about the software but I don’t want to give away all of our secrets…