Can your cloud backup provider fail?

Cloud backup providers aren’t infallible. Be sure to ask hard questions of providers about their storage redundancy, geo-replication, data integrity measures, and disaster recovery capabilities.

Programmer sitting at the desk in his office and trying to solve problem of new code.

Credit: Shutterstock

In the world of data protection, the cloud has emerged as a popular destination for backups. The appeal is obvious: Offload the complexities and costs of backup infrastructure to a provider, and rest easy knowing your data is safe in an off-site location. In addition, backing up to the cloud ensures that at least one copy of your backup data is stored off-site. Finally, many backup providers also offer immutable backups, which means that hackers can’t delete, encrypt, or corrupt your backups in any way – even if they gain administrative control over your backup system.

As popular as cloud backups are these days, it is important to understand that they are not without risk. Nothing illustrates this point more than the stories of Carbonite and StorageCraft, two cloud backup vendors that lost some of their customers’ backup data. Their experiences share some concerning parallels.

In 2009, Carbonite suffered a multi-disk failure of its backup storage arrays, resulting in a complete loss of the most recent backups for 7,500 customers.

In 2014, StorageCraft experienced a very different event with a very similar outcome. Its data loss was caused by human error during a cloud migration. An admin decommissioned and deleted a server prior to it being fully migrated to the cloud. This error led to lost metadata and an urgent scramble to help customers re-seed their backups. While the underlying technical details differ, the core issues in both cases are the same: inadequate redundancy and resiliency measures for storing customer backup data.

Carbonite relied on individual RAID arrays using only RAID5, leaving them vulnerable to several failure possibilities. The first and most common issue is the one they experienced. All it takes is a simultaneous failure of more than one drive to take out a RAID5 array. Given the number of drives in such array, their size (i.e. much larger than they used to be), and the fact that, well, drives fail… RAID5 has not been a recommended configuration for a long time. And yet that is the level they were using.

Data from 7,500 customers was residing on an array that was essentially a ticking time bomb. This also means that a flood, electrical short, or fire could have taken out their customers’ data as well. We shouldn’t forget that fires hit cloud providers as well, as we saw when OHV’s entire data center went up in smoke in 2022. Vendors holding other companies’ data should be using geo-redundant storage, such as object storage like S3, Azure Blob, or Google object storage.

We see a similar failure in StorageCraft’s design. The fact that an admin decommissioning a single server caused them to lose all metadata suggests a lack of geo-redundancy and fault-tolerance in their backup storage architecture. This had a single server with a single copy of the only data that would allow them to piece together all of their customer’s backups. Again, a fire, electrical short, or flood could have just as easily wiped out this company. In the end, however, it was a single human error. As a reminder – human error and malfeasance are the number one and two reasons why we back up in the first place.

As a backup professional with decades of experience, these incidents make me cringe. The 3-2-1 backup rule exists for a reason – 3 copies of your data, on 2 different media, with 1 copy off-site. Any responsible backup provider should be architecting their cloud with multiple layers of redundancy, geo-replication, and fault-isolation. Anything less is putting customer data at unacceptable risk. The loss of a single copy of any data in any backup environment should not result in the loss of all copies.

When something like this happens, you also look to how the company handled it. Carbonite’s response was to sue their storage vendor, pointing fingers instead of taking accountability. They saw nothing wrong with their design; it was their storage vendor’s storage array that caused them to lose customer data. (The lawsuit was settled out of court with no public record of what happened.) Carbonite’s CEO also tried to publicly downplay the incident, saying it was only backup data, not production data that had been lost. This was a point that was probably lost on the 54 companies who did lose production data because they needed to perform a restore that would have been possible only with the backup data.

StorageCraft reacted much better. Its CEO, to his credit, issued a public mea culpa, saying that he understands how critical this was. He also ensured the company did everything they could to help customers re-seed their backups, including shipping them drives to help get the data to them faster. (Drive shipping is a very common method used to make the first backup, referred to as the seed, happen much faster than it would over the internet.)

So, what lessons can we draw for businesses looking to use cloud backup services? Here are a few.

Don’t assume cloud means infallible. Ask hard questions of providers about their storage redundancy, geo-replication, data integrity measures, and disaster recovery capabilities. Get specifics in writing in the contract and SLAs. The best answer is if they are storing all backups on well-vetted, geo-redundant object storage, such as S3 or Azure Blob. You can be sure that no single fire or flood will take out your backups. To ensure that no human error or malfeasance does the same, ensure that they have true immutability. If their “immutable” backups can be overwritten by a phone call or email, they’re not really immutable.
Understand the shared responsibility model. While the provider secures and protects the cloud infrastructure, it’s still on you to manage your backup configuration, retention periods, and restore testing. Don’t abdicate full responsibility to the provider.
Have a contingency plan. Even with a solid cloud backup strategy, localized disasters can still threaten your on-premises data. Maintain physical backups or a secondary cloud provider to mitigate the risk of losing both production and cloud backup copies simultaneously. There are third-party backup services and software that allow you to copy data to multiple cloud providers. Look into those options and see if they are financially viable for your operation.
Don’t blindly trust marketing statements. Dig into independent reviews, talk to existing customers, and look for any history of lawsuits or public failures. Reputable providers will be transparent about their architecture and have a track record of reliability and responsive customer service.

The cloud can be a powerful tool for backup modernization and cost efficiency. But like any infrastructure decision, it requires due diligence to separate the truly enterprise-grade providers from those cutting corners. Carbonite and StorageCraft serve as cautionary tales of what can go wrong, and it’s a reminder to all of us in the backup community to remain vigilant in our mission to be data protection superheroes. Trust, but verify.

Can your cloud backup provider fail?

Cloud backup providers aren’t infallible. Be sure to ask hard questions of providers about their storage redundancy, geo-replication, data integrity measures, and disaster recovery capabilities.

More from this author

Backup lessons learned from 10 major cloud outages

Ransomware increases urgency of documented DR plans

Choosing a disaster recovery site

Is it time to change your backup system?

How to lock down backup infrastructure

How to determine RTOs and RPOs for backup and recovery

Data storage archive options: Batch, real-time and hierarchical storage management

Most popular authors

Show me more

Cisco patches actively exploited zero-day flaw in Nexus switches

Nokia to buy optical networker Infinera for $2.3 billion

French antitrust charges threaten Nvidia amid AI chip market surge

Has the hype around ‘Internet of Things’ paid off? | Ep. 145

Episode 1: Understanding Cisco’s Converged SDN Transport

Episode 2: Pluggable Optics and the Internet for the Future

How to use the stat command

The SL command easter egg

How to use the shuf command