.: .:
.:
.:
|
|
.: .:
.:
.:
.:
.:
.:
|
| |
|
|
|
 |
Data Deduplication De-Duped
Data De-duplication has become an increasingly popular buzz word among software and hardware technology manufacturers. As Data Centers experience continuous data growth, data de-duplication is one of the many possible solutions to provide relief. This article defines data de-duplication and the underlying benefits attributed to the technology. We also explore the growing differentiation among manufacturer's definitions of data de-duplication, and what data center managers need to know in order to make informative decisions.
Data de-duplication can be represented by many terms, such as capacity optimization, factoring, single instant storage or intelligent compression. These terms are all in reference to reducing storage, through means of eliminating redundant data. Data which is found to be redundant is replaced by a pointer leading to the original unique copy, thus preventing identical data to be recorded twice. The following scenario illustrates the technology as a CEO sends an email containing a 2 MB attachment to 1000 employees. Typically an email system will store 1000 copies of the 2 MB attachment; however de-duplication technology will store only 1 copy of the attachment and 999 pointers. As a result, 2000MB of storage can be contained in 2MB.
Each de-duplication technology incorporates the de-dup process differently. Data de-duplication can happen on many levels, file or block and bit level and inline or post process. File level de-duplication, also known as content addressable storage or CAS, refers most commonly to “single instant storage” reflecting the email example above. If multiple copies of the identical file are found, only one copy will be stored, along with the corresponding pointers. Typically, file level de-duplication will yield up to 4-1 data reduction. Block level or bit level de-duplication looks deeper into the files and finds redundant patterns of small chunks of data. Files are broken down into their blocks and bits, and the redundant patterns of blocks and bits are compared. For example a company uses the same letterhead in several word documents throughout their system. The bit pattern representing the letter head in each word document is detected and recognized as redundant data. As the word documents are broken down, pointers are recorded to represent the redundant data in each. Many times, block level de-duplication can yield as much as 20-1 data reduction. Inline de-duplication refers to the process of seeking out redundancies as data streams into the device real time. Post process de-dup occurs after the files have been written in their entirety to disk. Then the de-duplication process takes place to eliminate redundancies.
Benefits generally sought after through the introduction of data de-duplication are: significant reductions of disk storage, longer retention for disk based backups and disaster recovery.
The reduction of disk storage in an environment can generate significant savings. De-duplication reduces the amount of disk, thus reducing the amount of energy needed to power and cool the array as well as the entire data center. Data de-duplication plays a significant role in the “green initiative” for data centers.
Longer retention for disk based backup allows for organizations to move away from tape. Tape based backups can be cumbersome, timely, costly, and most important unreliable. Restoring 30 to 60 day old data from disk with a few short clicks is significantly faster and more efficient compared to the traditional methods associated with tape. If tapes have been offsited, they must be recalled, reloaded, and then read to see if the data can be restored. Many times tapes are misplaced or damaged when placed into storage and data is unrecoverable.
Replication is the key facilitator of disaster recovery. For years bandwidth was the limiting factor inhibiting replication for disaster recovery purposes. Costs were extremely high and availability was uncertain. Even though bandwidth costs are still high today, data is deduplicated, enabling continuous data replication using significantly smaller pipes. Disaster recovery strategies are now simplified and cost effective for organizations to implement.
In many cases, these benefits may be realized within your data center using the data de-duplications technologies described here. Vendors including, Data Domain, EMC, CommVault, FalconStor, and Quantum are already offering data de-duplication solutions. Which technology will particularly benefit you data center depends on your environment and de-duplication may be one of the technologies worth looking into.
The author, Nicholas Cerrone, is a storage consultant with Zibiz Data Management.
Click here to contact Zibiz about our Data Deduplication solutions
|
|
 |
|
|