Knowing your backup options is vital
In one of my previous jobs we had a major issue in our development environment. The problem was causing the entire team to stop working, so picture about 50 developers sitting idle.
We had a pretty good idea where the problem was, but in order to verify and solve it, we needed to restore data between the different storage systems that we used. The amount of data itself was not that large, around 20TB at most, but the transfer could take up to 24 hours — the file system had more than 40 million files requiring a lot of overhead. Since this solution was slow and impractical, we searched for other options. Eventually one of our consultants recommended to copy the data using the OS-specific block level copy. That reduced copy times from 24 hours to under eight, meaning we could test three solutions a day instead of just one.
This short story illustrates how important it is to be familiar with all your backup options. In this post I’ll dive into five best practices for addressing backup and restore challenges when working with virtual environments.
1. Use the native infrastructure capabilities
Familiarize yourself with the infrastructure as well as you can. Each vendor, product and technology can have capabilities that are easily overlooked. Storage and virtualization vendors today allow the use of Change Block Tracking (CBT) for quicker incremental backups, VMware vSphere Storage APIs Array Integration (VAAI) for better data movement inside the storage, and more. Learn as much as you can about these technologies, since once you start leveraging these capabilities backups can be accomplished much more effectively.
2. Avoid using VMware snapshots for data retention
A common misconception is that a VMware snapshot is like a storage snapshot — this isn’t the case. VMware snapshots freeze the VMDK files of the running VM, resulting in a log file that is constantly storing the block changes. This degrades performance due to the additional I/O imposed by this logging. In addition, as the VMDK file is not being written to and all new writes are done to the log file, more space is consumed. This can fill up the datastore if left without monitoring.
When deleting a VMware snapshot, VMware needs to "commit" all the changes from the log file into the live virtual machine VMDKs. This creates a second performance hit on the virtual machine. The bigger the log file is, the longer it will affect the virtual machine.
Use VM-based snapshots for short period times and only if there is a need to quiesce the data. Rely on storage snapshots for data retention and as a backup solution. Those snapshots can integrate with VMware snapshots if required.
There are also technologies such as BackDating™ that provide snapshot-based backups to VMs without affecting the datastore size and overall performance. BackDating provides even better data protection than traditional storage snapshots since it records all writes to the disk and provides 1-second RPO.
3. Examine backups schedules carefully
We all know the concept of a backup window: It’s used to prevent backup jobs from interfering with production work. The best way to avoid putting too much load on production systems is to plan according to your environment. Don’t backup too many VMs concurrently, as it will put a heavy load on the compute and storage environments.
Similarly, you can avoid storage congestion by limiting the number of VMs you backup from the same datastore. However, if you don't backup enough VMs at the same time, your backup windows can become too large. Use the storage system capabilities to offload the backup of the production environment.
4. Do not backup at the VM OS level
This can lead to some major issues. First, backing up at the VM level means you need to manage the backup of each VM independently, which defeats the purpose of using VMs in the first place. Second, as illustrated in the short story above, you are subjected to the type of data that resides on the VM itself. If the VM has millions of files, it will need time to go over each one to back it up, instead of just taking a backup of the VM at that point.
In this case, it’s better to go with a storage-level backup rather than at the OS level of the VM. This is especially true if the vendor provides some form of deduplication feature. Look for inline and global deduplication, as this greatly enhances backup performance as well as optimizing the use of your storage media.
5. Enforce quiescing and VSS
Crash consistency used to be a big thing. Nowadays, most applications can easily recover from crash consistent state. For example, Oracle’s Storage Snapshot Optimization uses storage-side snapshots of the database to capture data changes without putting the database in hot backup mode.
However, in order to ensure that everything works as expected, quiescing should be enforced. The best way to do that is to use the VMware Tools Quiesce option to stop IO to the VM and back up that specific point in time.
There are new (and better) ways to approach backups
These best practices can improve what you are doing today with your existing VM infrastructure, but there new technologies available to help you better backup and restore your data. For example, a simpler approach to VM backups is using a flash storage platform that has advanced data protection capabilities built into it.
A good place to start learning about these technologies is this recent Forrester report, which highlights the complex Data Protection challenges that customers face and how three new innovators are changing the game for enterprise data protection.