In the last chapter, we started the operations and monitoring topic by looking at logging and monitoring. In this chapter, we’ll explore another important subject – the ability to recover from a complete outage.
Azure provides a range of recovery solutions that offer different features; these include traditional backup and restore functionality for VMs and the data on them, Azure Site Recovery for continually replicating VMs, and Azure SQL or Azure Cosmos DB backup facilities.
Finally, we will also look at a related feature – the ability to move infrequently accessed data to cheaper storage for long-term retention in a cost-effective way.
By the end of this chapter, you will understand how to choose between the different backup and recovery solutions depending on your organization’s needs, taking into account how quickly services must be restored and how much data loss is, or isn’t, acceptable.
With this in mind, we will be covering the following subjects:
- Understanding recovery solutions
- Planning for Azure Backup
- Planning for Site Recovery
- Planning for database backups
- Understanding the data archiving options
Technical requirements
This chapter will use the Azure portal (https://portal.azure.com) for examples.
Understanding recovery solutions
Throughout this book, we have looked at how to architect solutions that are resilient to failures or outages. Often this has involved duplicating components such as VMs, web apps, and even databases. Sometimes we have duplicated systems within a region to protect against hardware or individual data center failures, or cross-region to protect against entire region outages.
However, this comes at a financial cost – doubling up on a database or VM means doubling the costs as well.
Sometimes the cost of an outage outweighs the cost of duplicate components – if an application is used continually and is revenue-generating, a 1-hour downtime could cost millions, and therefore the increased cost of another database is negligible.
Not all systems are this sensitive. For lower-budget solutions, or systems that are not as critical, potential downtime may be preferable to an increase in hosting costs. Therefore, when architecting solutions, we must understand the business needs. Often, the amount of downtime an application can withstand is expressed in terms of a Recovery Point Objective (RPO) and a Recovery Time Objective (RTO).
Understanding the Recovery Time Objective (RTO)
The RTO defines how quickly you must be able to recover from an outage. If your RTO is at or close to 0, then building the highly available solutions we have looked at so far is the best course of action. If, however, your RTO is 24 hours or more, then a traditional backup and restore design may be more suitable and cost-effective.
An RTO of 24 hours means you have a full day to rebuild or restore your solution from backup, which, depending on the backup solution you choose and the size of your application, may be adequate.
Leave a Reply