|
|
 |
Data Deduplication Frequently Asked Questions
What is Data Deduplication?
Data Deduplication is similar to data compression, but it looks for redundancy of very large sequences of bytes across very large comparison windows. Long (8KB+) sequences are compared to the history of other such sequences, and where possible, the first uniquely stored version of a sequence is referenced rather than stored again. In a storage system, this is all hidden from users and applications, so the whole file is readable after having been written.
Why deduplicate data?
Eliminating redundant data can significantly shrink storage requirements and improve bandwidth efficiency. Because primary storage has gotten cheaper over time, enterprises typically store many versions of the same information so that new work can re-use old work. Some operations like Backup store extremely redundant information. Data Deduplication lowers storage costs since fewer disks are needed, and shortens backup/recovery times since there can be far less data to transfer. In the context of backup and other nearline data, we can make a strong supposition that there is a great deal of duplicate data. The same data keeps getting stored over and over again consuming a lot of unnecessary storage space (disk or tape), electricity (to power and cool the disk or tape drives), and bandwidth (for replication), creating a chain of cost and resource inefficiencies within the organization.
How does data deduplication work?
Data Deduplication segments the incoming data stream, uniquely identifies the data segments, and then compares the segments to previously stored data. If an incoming data segment is a duplicate of what has already been stored, the segment is not stored again, but a reference is created to it. If the segment is unique, it is stored on disk.
For example, a file or volume that is backed up every week creates a significant amount of duplicate data. Data Deduplication algorithms analyze the data and can store only the compressed, unique change elements of that file. This process can provide an average of 10-30 times or greater reduction in storage capacity requirements, with average backup retention policies on normal enterprise data. This means that companies can store 10TB to 30TB of backup data on 1 TB of physical disk capacity, which has huge economic benefits.
Is data deduplication easy to implement?
This is vendor dependent. Data Domain has made it very easy by creating a fast, application-independent storage system (attachable as a file server over Ethernet or a VTL over fiber channel). No client software or other configuration is required. As a result, Data Domain data deduplication should be invisible to backup and recovery and other nearline storage processes. It should easily work with various data movers and workloads, including non-backup data like e-mail archives, reference data and engineering revision libraries. More flexibility means more consolidation is possible using less physical infrastructure.
Is deduplication of data safe?
It's very difficult to harden a storage system so that it has the resiliency that you need to remain operational through a drive failure or a power failure. Find out what technologies the data deduplication solution has to ensure data integrity and protection against system failures. The system should tolerate deletions, cleaning, rebuilding a drive, multiple drive failures, power failures - all without data loss or corruption. While this is always important in storage, it is an even bigger consideration in data protection with data deduplication. With data deduplication solutions, there may be 1,000 backup images that rely on one copy of source data. Therefore this source data needs to be kept accessible and with a high level of data integrity.
While the need is higher for data integrity in deduplication storage, it also offers new opportunities for data verification. In Data Domain's case, we take full advantage of the small resulting data size to do a complete internal test recovery, end to end, through the file system and to the disk platter, following each backup. There is less data to read than in a normal disk system, so this read-after-write operation is possible.
Single Instance Storage vs Data Deduplication
Data Deduplication is fundamentally different than the concept of Single Instance Storage, or SIS, which is a more limited form of deduplication where duplicate copies of files are reduced. This file level deduplication is intended to eliminate redundant (duplicate) files on a storage system by saving only a single instance of data or a file. If you change the title of a 2 MB Microsoft Word document, SIS would retain the first copy of the Word document and store the entire copy of the modified document. Any change to a file requires the entire changed file be stored. Frequently changed files would not benefit from SIS. Data deduplication, which reduces sub-file level data, would recognize that only the title had changed - and in effect only store the new title, with pointers to the rest of the document's content segments.
Inline vs. Post-Process Data Deduplication
Inline data deduplication means the data is deduplicated before it is written to disk (inline). Post-process data deduplication analyzes and reduces data after it has been stored to disk.
Inline data deduplication is the most efficient and economic method of deduplication. Inline data deduplication significantly reduces the raw disk capacity needed in the system since the full, not-yet-deduplicated data set is never written to disk. If replication is supported as part of the inline data deduplication process, inline also optimizes time-to-DR (disaster recovery) far beyond all other methods as the system does not need to wait to absorb the entire data set and then deduplicate it before it can begin replicating to the remote site.
Click here to contact Zibiz about our Data Deduplication solutions
|
|
 |
|