Demystifying Data Deduplication: The 3 Types of Dedupe



By Narayanan Prasath for Beyond The Blocks - Wednesday, September 21, 2016

Reduxio_Data_Dedup-01.jpg

 ‘Deduplication’ is a word that gets thrown around by vendors right and left. With the rise of flash, dedupe has become more relevant than ever to take the most advantage of the precious storage space in this expensive and super-fast media. However, a lot of buzz words are used when describing deduplication technologies, which makes it difficult to compare different offerings.

At the end of this post you will know everything there is to know to make an informed decision about dedupe technologies.

Data deduplication has been around since the 1970’s when companies had to reduce the physical storage of customer information. If a customer has the same address for billing and shipping, they would just hold one copy of it and trash the other one. That was deduplication in its primitive form.

After the widespread use of computers and the Internet, data management became crucial for businesses to reduce storage costs. That’s when data deduplication came into picture and became a key component of storage solutions. It’s been more than two decades since it entered world of computing, and it hasn’t changed that much until recently.

Dedupe is a data management technique that eliminates redundancy to increase efficient use of space, enabling improvements in capacity.

In the data deduplication process, the data blocks are analysed to identify duplicate blocks and the system stores only one copy of it while deleting the rest. By doing so, it does not require huge space to store all of the data, thus reducing capacity needs and utilizing storage space more efficiently.

While this is just a high level intro, things get different when we go deeper.

The point in which (storage) deduplication occurs has tremendous influence on the efficiency and effectiveness of deduplication. In the storage pipeline, data travels from host to cache to disks. Dedupe initially was performed at the media level (disk, tape). As the technology evolved, the dedupe process was moved higher up to the cache level (RAM). In simple terms, it is ‘Post-process’ if the dedupe takes place in the media itself and ‘In-line’ if it takes place in the cache.

Capacity, speed and durability - which are the most important requirements for any business - are directly influenced by the point at which dedupe occurs. Currently, in-line dedupe is adopted by most of the storage solution companies because of its clear advantages over post-processing which we will cover in this post.

But is in-line dedupe good enough?

With tons of data pouring in from various directions - IoT, mobile, videos, etc - demand for storage capacity is spiking at exponential rates.  It’s almost impossible to rely only on hardware technology to handle capacity and speed demands.

The critical question becomes. Why should you allow the system to store (write) duplicate data when you know it is duplicate?

That’s where In-line In-memory comes in by performing global deduplication. But before we dive into In-line In-memory, first let’s take a look at post-process and in-line deduplication, which are the two most common deduplication methods (provided by most of the storage solution companies out there in the market), to understand their downfalls and why you should think twice before investing in them. 

POST PROCESS

Post-Process.gif
Click here to tweet this .gif

 

The earliest dedupe technique:

Post-process was the first of all deduplication methods in the storage solutions market, here dedupe happens at the disk level, which means that the incoming data has to be stored first (taking up capacity) before any dedupe takes place. All of the incoming data (un-deduplicated) is written to the cache, and then moved to disk or SSD. Deduplication happens only after this move.

Depending on the actual product, the dedupe may occur at the SSD or disk level or both. All blocks have to be scanned and compared with each other to find the duplicates. Since this process is very slow, it is often scheduled to run only at night. In most products it also makes it impractical to dedupe at the system level - blocks are only compared within the scope of a single volume or a RAID set.

Since, all data coming in has to be stored somewhere before deduplication can take place, the capacity has to be large enough to store them.

Post-process dedupe was designed at a time where CPU resources were quite expensive, and the main media was disk drives. For this reason, delaying the deduplication process to occur after the data is ingested frees up the potential bottleneck and allows for faster write throughput.

However, newer in-line implementations using multi-core processors and faster media proven that dedupe can occur in closer proximity to the arrival of data into the system without impacting performance.

TRADITIONAL IN-LINE

Traditional.gif
Click here to tweet this .gif

In in-line deduplication, all of the incoming data is written in the cache, but unlike post-process not all of it is moved to the media level. Deduplication happens within the cache, and only deduped data is written to the media. Since the output of dedupe is highly random, this technology was only implemented in SSD-based systems. The number of read/writes is significantly reduced between the cache and flash disk. With In-line deduplication in place, precious SSD capacity becomes more affordable because you are storing only deduped data in flash.

But, what about processing time?

In-line requires lot of processing power and in cases where high volume data enters the system, network bottlenecks occur due to latency in write operations, decreasing server performance. But in comparison with post-process, in-line has several advantages, such as increased effective capacity, reduced IOs inside the system, thus saving plenty of time and resources. Again in-line had to write the data (duplicate data) in the memory before deduplication can take place.

Almost every major storage vendor today is incorporating in-line deduplication in their all-flash systems for its clear advantages over the post-processing dedupe. But customers need to understand more accurately  the type of deduplication, and investigate where exactly in the storage it takes place, before making a purchase.

Because, not all “in-line” is truly In-line.

While several of them claim to perform in-line dedupe, they actually perform post-process dedupe due to their incapability to handle overloads , their controllers push raw data to flash disks and perform dedupe in the disk level. In such cases, performance get a hit and effective capacity is far lesser. It is In-line only when the dedupe occurs at the RAM level itself.

Writing duplicate data into media during overloads is a major drawback for “inline”, because it acts like post-process. It only partially dedupes inline in order to keep up the performance, compromising capacity.

Performance has always been a trade off to capacity, but things are different with the type of dedupe we are going to see next.

IN-LINE IN-MEMORY

NoDup-1.gif
Click here to tweet this .gif

In-line In-memory dedupe eliminates the need for deduplication itself, thereby speeding up the storage process and making its predecessors look like dinosaurs . In-line In-memory eliminates the existence of duplicate data anywhere anytime on any tiers by identifying duplicate data right as it falls off the wire before being written anywhere in the system.

Really?

It’s possible to pull it off by hashing (calculating a mathematical fingerprint for the data blocks) the data as soon it arrives to the system as a write request. The dedupe engine then looks for a match of the fingerprint across the entire database (comparing with blocks in any of the tiers). If there is a match, it ignores the write request and references it to the already-written data block. If there is no match, meaning this data block is unique, the data is written into the cache.

All this happens before the data is written anywhere.

Therefore, the cache, RAM, SSD and disks only hold unique data blocks at all times, drastically increasing the effective capacity and storage performance. The non-duplicated data in the cache is also compressed before it leaves to the SSD, effectively doubling the capacity beyond the deduplication ratio.

Simple concept, but huge benefits.

This simple yet powerful technology has huge implications when it comes to optimizing number of IOPs. By not allowing any duplicate data written to the system, right from the moment it comes off the wire, the number of read/writes between several tiers is far smaller since only deduped data is in motion right from cache to SSDs to spinning disks, which also means more durability (less wear and tear).

By identifying duplicate data as the IO comes in, instead of at the cache, RAM or SSD levels In-line In-memory dedupe resolves the performance issues of traditional in-line dedupe making it a better solution.

Another important feature that sets it apart from others is it’s dedupe aware cache. In others, when the host reads the data (even if deduped), the cache loads multiple copies of it, whereas in In-line In-memory, the cache loads only one copy. Leaving enough room and resources for other operations, and not affecting performance. The clear benefit of this can be seen in VM boots. Several hundred VMs can be started concurrently right from the RAM itself, which makes it super fast.

THE BOTTOM LINE

It can be quite tricky to find the perfect storage solution that suites your infrastructure and more importantly the one that satisfies your business needs. Existing infrastructure, data type and business needs are the key consideration to look after while deciding the right storage solution for your organization.

Again, storage vendors provide a variety of solutions to optimize capacity and performance. The key takeaway from this post is that the point at which dedupe takes place is crucial. To put it simply, the earlier in the process dedupe takes place, higher the effective capacity, and the later it takes place, higher the performance. Performance and capacity have always been inversely proportional until now with technologies like In-line In-memory coming up the surface.

In-line in-memory is an emerging solution that has just started gaining attention among the early adopters because of its capabilities to improve effective capacities and reducing costs, while not penalizing performance, but actually enhancing it.  

Not only does In-line In-memory dedupe takes place much earlier in the storage process than any other data deduplication technique , it is able to achieve superior performance, capacity and durability clearly by combining many nifty micro-level tactics at various points in the system to optimize operations. Making the whole greater than the sum of its parts.

Making it not only capable to handle today’s storage requirements, but one that is suited for demands that are yet to come in the future.

The applications of in-line in-memory dedupe  are continuously evolving and with achieving exceptional performance and capacity savings, it  can easily resolve all of your storage challenges.

 

NoDup Product Brief

 

Narayanan Prasath

Written by Narayanan Prasath

Narayanan holds a mixed and unusual background. He served as IT infrastructure engineer in a major financial company, yet, he is passionate about visual effects and has worked with several independent film-makers and start-ups helping them tell their stories. At Reduxio, Narayanan works to inform IT professionals about the latest trends and news of the storage industry, and how to navigate the complexity of the market to make better and smarter decisions, in a visually compelling manner. He is also a digital artist and worships human-centered-design.



Want to comment on this blog post?