Data corruptions and loss of access to data are part of every storage manager’s life and will probably always be. A lot has been and is being done to minimize the loss of use of data with millions of dollars invested in R&D to minimize these occurrences. But data corruptions due to human error, software errors, and logical errors in applications, and loss of use due to cyber attacks will always be with us.
Once the worst-case scenario arises and the preventive systems in place fail, it’s time for the safety net to jump into the scene and save the day. Basic engineering principles indicate that the fallback mechanism of the system needs to be the safest of all. Snapshots technology delivers just that. However, given that snapshots were designed and introduced more than 20 years ago they suffer a lot of limitations in terms of functionality and have become a burden to manage.
In this post, we’ll go over:
- Snapshots technology overview.
- A bit of the history of snapshots.
- Innovation in storage snapshots.
- Use cases.
- Limitations of the usage of snapshots intoday’s IT.
- The future
So… What are snapshots?
Snapshots are a technology for storage systems that allows them to capture and preserve the state of volumes related to applications, databases or virtual machines at certain points in time; many times referred to as “point-in-time snapshots”.
A snapshot is typically captured in one of the following methods:
- Random snapshots: are created without any interaction or integration with the applications or file systems and taken in “random” moments in time. These snapshots are usually not considered as “safe” by many customers even though, in reality, modern applications and file systems are crash-consistent - meaning that they can recover from random power cycles, hence, they can recover from “random” snapshots as well.
- File system crash-consistent snapshots: which freeze all of the IO to the file system.
- Application-aware or application crash-consistent snapshots: these work by switching the applications (for example a database system) to what is called a hot backup mode - forcing the database to perform extra logging to enable the database to start up cleanly after a recovery from the snapshot.
Snapshots were a breakthrough… 20 years ago.
Snapshots were introduced in the 90’s as a way to shorten backup and restore windows. The adoption of snapshot technology fundamentally marked a before and after moment for storage backups.
The era before snapshots:
Before snapshots were introduced, customers were used to restore from tape for all cases where data or applications needed to be recovered after a failure. For example in the case of databases, you needed to put the database in hot backup mode, then run a backup to tape for many hours: the actual backup time is the time it takes to read the entire dataset and write it to tape cartridges - and this typically takes hours to days for a large dataset. As you could expect, the database would suffer a performance degradation during this time. In order to recover data, you had to read the entire set of tapes and write the data back to the storage system .
The era of snapshots (until now):
As we mentioned before, snapshots essentially eliminated the time it takes to preserve a point in time of a volume and the time to recover to that point in time in case of a failure. When a snapshot is captured for a volume, it becomes immediately available as point to which an application or a data set can be restored.
Primary Innovation in 20+ years of Snapshots
Since the introduction of snapshots, few improvements have been made to this technology to adapt to increasingly demanding IT ecosystems. The only major innovation was in how the point in time is captured, the mechanism used to write data after a snapshot is taken. There are two basic ways in which to write data.
Initial snapshot implementations were Copy-On-Write (CoW). This means that for every new write to the system, the system must first read the data you are about to change, copy it somewhere else, then write the update in. This causes additional IOs, which limits performance when there is a snapshot..
Newer implementations eliminate some of the performance issues by using what is called Redirect-on-Write (RoW). In RoW snapshots, data that is changed can stay as-is, while updates are written to new locations on disk - thus eliminating the overhead after capturing a snapshot.
What do people use snapshots for?
There are 2 main use cases for storage snapshots:
- Data protection: the most straight-forward use of snapshots - employed as an online logical backup and recovery, which is scheduled once every few hours of running databases, applications and virtual machines:
Protecting databases is considered a major use case for snapshots. Being able to quickly recover from failures is highly important for any of the database workloads - online transactional processing (OLTP), DSS, data warehouse etc.
It is common to use storage system snapshots as means to back up any of the common databases: Oracle DB, Microsoft SQL Server and MySQL, to name a few.
- Virtual Infrastructure:
Virtualization environments which are now core of the IT infrastructure are regularly protected by the use of snapshots. The virtual machines are backed up every few hours by capturing datastore-level or VM-level snapshots.
It is also common to protect other applications, either ones based on files or their own custom databases. Snapshots are used as the first line of back up for these applications.
- Test/dev: Snapshots are also used for cloning database and application environments for developers and testers. For example: Suppose that there is a running production database-based application in an insurance company. Each night, this database is cloned and used to “refresh” the developer's’ environment - so developers can always test their code on a database that resembles the production one without actually risking it.
What are the limitations of snapshots?
For a technology that’s been the safety net for storage systems for the last 20 years, snapshots have been able to hold their ground. However, increasing demands make this solution seem almost archaic. These are the limitations of snapshots:
- For data protection: data is not protected most of the time - it’s actually at risk between snapshots. For example, imagine a call center that uses a central database to document all incoming calls and customer information. This call center’s data is protected using snapshots every 4 hours. However, if the last snapshot was captured in 10am and a problem occurs at 1:30pm, 3.5 hours of customer updates are lost forever.
- For test/dev: in all known implementations designed for cloning, when you clone a volume, it’s impossible to delete the parent volume. This makes it very complex to build more advanced clone hierarchies that involve multiple branches from the parent.
- For storage in general: Snapshots use internal mapping tables to keep track of the location of blocks in snapshots - for each block there’s an entry that indicates to which snapshot it belongs to. This implies there’s a practical limit to the the total number of snapshots supported because these maps cannot be so large that they take over all the storage capacity. So with snapshots there’s typically only tens to thousands recovery points supported.
- Consistency Groups: Protecting complex application that may data stored on multiple volumes requires snapshots be taken across all volumes consistently. This had been done using consistency groups. This imposes additional level of management overhead.
The future of storage backups: Backdating™
For each write IO, storage systems have historically only stored the location it belongs to (the corresponding disk and the offset with the disk). Reduxio TimeOS™ added a new dimension: time. For each write IO, the system also maintains a timestamp. All of these writes are stored in a unique, highly efficient metadata structure that represents physical deduped and compressed blocks, and the logical references to these blocks - both in terms of offset and, more importantly, time of IO.
When a read comes into the system, TimeOS™ will query its metadata, searching for the relevant logical blocks that would answer that read request. The request may come in to read the latest version of the logical block, but it could identically come in to read a previous version of that same block. All in all, a data read into the past is performed the same way as a read of the latest version. This allows a free time travel into the history of the data without any performance penalty - we call this BackDating™.
How does BackDating™ overcome the limitations of snapshots?
Backdating is not a different way to do snapshots. It’s a technology that was built from the ground up to address the needs of today’s storage managers that snapshots just can’t meet. These are the advantages of using backdating:
- Data is continuously protected without time gaps - which are inherent to the usage of snapshots.
- Does not require a schedule, whereas snapshots imply cumbersome schedule management.
- Works out of the box without the need for configuration.
- Allows the user to have millions of recovery points without the huge amount of space that would be required to do the same with snapshots.
- Automatic consistency: Volumes are consistent with each other for recovery points at the same time across volumes.
- No dependencies between volume clones and the original volumes.
- History data is deduped together with current, reducing the overall cost of keeping history.
BackDating was created to be a simple, yet powerful and reliable solution for storage backups. While the IT activities becomes more complicated, there is an opportunity to drastically reduce the complexity of critical tasks, hence allowing managers and administrators to focus on what really matters.