Failover And Autoscaling In Microsoft Azure: An Overview

This is a companion post for my Maine Server Side And Performance Meetup talk, “Microsoft Azure Autoscaling and failover with Doug Vanderweide” on June 15, 2015.

At its heart, cloud computing implies resilience and elasticity: The ability not only to automatically survive failures, but to automatically scale to a workload.

In Azure, Microsoft’s cloud computing offering, failover and scalability come in two flavors: compute and storage. While those two cloud service aspects handle failover and scaling differently, generally speaking Azure is both highly elastic and highly resilient.

That is, provided you make the things you put on Azure amenable to elasticity and resilience.

The ability of any cloud-based solution to adapt to its operating conditions depends entirely on its architecture, and that means thinking outside of the traditional monolith and instead, in terms of “microservices.”

Microservices pattern. Via, used by permission.

A Quick-And-Dirty Overview Of Microservices

What are microservices? Let’s ask Wikipedia:

In computing, microservices is a software architecture style, in which complex applications are composed of small, independent processes communicating with each other using language-agnostic APIs. These services are small, highly decoupled and focus on doing a small task.

This is also commonly called “service-oriented architecture,” or SOA. And it should be familiar to those who are familiar with .NET / the old COM+ means of Windows programming.

You can think of a microservice as being akin to a NuGet package or dynamically linked library (DLL). As with code libraries, a microservice has, as part of its charm, interoperability; we can reuse the same microservice to provide data to many different endpoints / solutions.

Another similarity is, provided we have correctly interfaced with either a DLL or a microservice, we can make internal changes to it, without affecting the code dependencies of the other services that call upon it.

For example, when designing an authentication DLL, as long as our login and logout routines expose the same interfaced object and methods to callers, we can change how we go about authenticating people without harming any external code. The same idea holds true for a microservice: As long as our endpoints remain the same, how the microservice goes about a task is of no concern to other microservices.

Unlike DLLs or NuGet packages, microservices are truly in their own processing containers: That is, they’re effectively standalone programs that can handle I/O from anyplace.

Also, microservices tend to be very narrowly focused. A old COM+ DLL might contain thousands of lines of code to provide all kinds of classes that cover every aspect of a programming need, such as data interchange; in a microservices pattern, one would break that up into several smaller services, each of which can be called as needed.

What the microservices pattern does, at the end of the day, is increase the amount of complexity to a solution, and require a significant amount of cross-communication and coordination between the constituent parts of a solution.

This is generally handled, in Azure, via the use of HTTP-based APIs and the Service Bus, which is a type of messaging queue.

For example, if we have a website (itself, a microservice) that needs to call some JSON to populate a graphic, we might have our site call directly to an HTTP endpoint to retrieve that JSON from another microservice.

Or, if we were accepting image uploads and wanted to resize them, we might have our website save the image on cloud-based storage, then send a message to a worker process that does the resizing. That process would do the computationally expensive resizing work, then send a message back to the website, letting it know the work is complete.

microservices deployment
Microservices deployment. Via, used by permission.

Cloud Services

When we follow the microservices pattern, Azure computing services are easily scaled and failed. This is true of both Infrastructure as a Service (IaaS), which are basically virtual machines, and Platform as a Service (PaaS), which are basically anonymous operating system instances that host a specific code base, called “worker roles.”

Whether it’s IaaS or PaaS, however, autoscaling and failover are based on the premise of quantity, not quality.

In other words, Azure autoscaling and failover are based on the premise that we use more or fewer worker processes, based on need; not bigger or smaller machines. So it’s not a case of going from a 2-core, 4 GB RAM machine to a 4-core, 8 GB RAM machine if demand spikes; instead, we would spin up a second 2-core, 4 GB RAM machine to help out with the increased workload, then deallocate it when it’s no longer needed.

The way this is accomplished is through the use of “cloud services.”

In Azure, a cloud service is a grouping of computing resources — virtual machines or worker roles — which are addressed through a single domain name, and which share a set of TCP / UDP ports. A cloud service effectively acts as a router and load balancer, although it’s technically just an addressing container.

A basic Azure IaaS cloud service, configured for high availability. All machines are in the same “availability set,” and at least two are running at all times. All three machines communicate to outside resources via the cloud app container.

To ensure high availability, we assign multiple instances of the same computing resource to a single cloud service. That cloud service, in turn, monitors all the computing resources that are within it.

All compute instances within a given cloud service share the same set of TCP / UDP endpoints.

A cloud service will “round robin” requests on the same port. For example, if we want to create a high-availability IIS server via virtual machines, we would create a public port of 80, which would be shared by all the machines in the cloud service. Which machine handles a specific request on port 80 would be determined by the cloud service, based on the workload each machine that is listening on that port faces at the time of the request.

In some cases, we can assign specific endpoints to certain machines; however, if we have machine-specific endpoints, we can never be sure that the machine which has those endpoints is listening.

In some cases, this is fine; consider an IaaS instance running as an FTP server. We would share port 21 — the FTP control port — between machines. But for PASV FTP communication, we would specific machine-specific ports; by virtue of the fact that the cloud service tells the client, after connection, to send data on, say, port 10035, which belongs only to a specific FTP server in the cloud service, the data would go to the only machine listening on that port.

azure availability set
A slide from “High available BizTalk infrastructure on Azure IaaS” on

VM Availability Sets

Additionally, we want virtual machines, within a cloud service, to be within the same “availability set.”

In Azure, an availability set ensures physical separation of computing resources, to prevent a single power or network faults from affecting at least one other running machine. Machines in an availability set also will not update their OS at the same time.

Therefore, each availability set should have at least two virtual machines in it, and each IaaS cloud service should consist of at least one availability set, if we expect the cloud service to be resilient.

traffic manager
Traffic Manager based failover load balancing, from

Compute Failover

Now we can finally talk about how, specifically, Azure handles failover for compute instances.

In the case of an IaaS cloud service, a bare-minimum failover configuration would be two virtual machines, running in the same availability set. These machines would need to be identical in terms of configuration and software they are running.

If VM1 were to fail, the cloud service would detect this failure and automatically route all traffic to VM2.

  • If VM1 has machine-specific ports, calls to those ports would be rejected until VM1 is recovered and restarted.
  • Azure will attempt to recover the failed machine and automatically restart it.
  • Azure also will monitor your VMs as they run, and attempt to “heal” any troubled machine before it crashes.
  • If the machine can’t be recovered — such as after suffering a catastrophic disk failure — you can program a diagnostic solution to alert you about the failure.

A best practice here is, once you get a specific VM configured to do what you need, capture that VM’s hard drive — which is basically a file kept in storage — and use that image to create additional machines.

This allows you to not only quickly create new VMs within the same cloud service and availability set, but to quickly recover from a lost disk.

Azure offers a VM backup service, as well; and there are third-party solutions that enhance and extend recovery options for lost VMs, including in real-time.

Failover is more straightforward in a PaaS solution. In that case, you basically scale up at least two instances within the same cloud service, and tell Azure you always want at least two instances running.

If a specific instance fails, Azure will take your worker role code, clone it to a new instance, and put it into the affected cloud service, all the while routing incoming requests to the instance that has not crashed.

What about those cases where your solution needs to survive a datacenter outage? In that case, you’d want to basically replicate your compute solution to a second data center, then use Traffic Manager to handle routing.

Traffic Manager is a DNS-based routing solution in Azure, which allows you to send requests to different cloud services on the basis of performance (which thread is least stressed at the moment), round-robin (next thread of n threads) or failover (first thread in the list to respond).

Using Traffic Manager with at least two cloud service-based workflow entry points, we can have a very high certainty that no matter what, our job will be processed.

Azure management portal pane showing configuration settings to autoscale a worker role-based cloud service by CPU demand, via

Compute Autoscale

Azure handles autoscaling of resources pretty much the way it handles failover: It invokes additional instances, based on the maximum number of instances you are willing to deploy.

Autoscaling can take place under three circumstances:

  • Schedule: At a given time, you can spin up additional resources, then spin them down at some other time. For example, suppose you want to do some data warehousing in the wee small hours, every day. You would spin up additional SQL Server VMs that handle your warehousing, go ahead and replicate your data, then spin those machines down at some hour by which the job is complete.
  • CPU: You can specify additional instances to come online if CPU use for the entire cloud service reaches some percentage, then scale back down when demand is lessened. For example, you could tell Azure to start an additional two VMs if CPU demand reaches 80 percent, and to spin those machines back down once demand regresses to 40 percent.
  • Queue backlog: If your cloud service is listening to a messaging queue, you can instruct Azure to spin up additional instances once there are a certain number of backlog messages; the extra instances will spin down once the backlog is cleared.

This methodology can make it difficult to determine the exact cost of your Azure computing resources, since you are charged by the hour, and based upon the CPU count and memory size of each compute instance. When instances come online, and how long they stay up, is automated; and there’s no way to tell Azure to not spend more than X number of dollars per billing cycle.

Correction, 18 June 2105: As Shawn Michael Campbell pointed out to me during the Meetup, you can set a monthly spending cap on Azure.

However, exceeding that cap will cause Azure to spin down all compute resources and put your storage into read-only mode.

Since this is probably an entirely unacceptable outcome of a spending cap … for all practical intents and purposes, you cannot cap your monthly Azure spending.

So if predicatable pricing is important to you, you’d probably engage in a cloud services anti-pattern and make a single instance of your solution powerful enough to handle almost all traffic thrown at it.

Again, that’s a huge code smell. Accept that your Azure bill will be somewhat unpredictable, but that you will probably be able to compute a “worst-case” cost scenario, and go with that.

Storage Failover

Azure handles failover in its storage services — files, NoSQL-like data stores and SQL Server — through redundant copies.

Let’s talk about file / NoSQL recovery first.

Azure has three levels of redundancy for such files:

  • Locally redundant: Three copies of the same container are kept in the same datacenter.
  • Geo-redundant: In addition to the local copies, three copies of each container are also kept in a different, predetermined datacenter. For example, US East datacenter files (Virginia) are georeplicated to US West (San Francisco).
  • Geo-redundant read-only: This option provides a “backup” location that allows for immediate reading of storage files. Although you would need to write code that accesses this backup location, in the event of a datacenter failure, it does allow you to work around, to some degree, a datacenter failure.

So generally speaking, absent a complete failure of a datacenter’s storage service tier, you should find that local redundancy is adequate for continuous operation; if Azure can’t read from your primary storage volume, it will attempt to use one of the replicants, and automatically replace the damaged volume.

Geo-replication will basically assure you won’t lose your data in the event of a major catastrophe; and geo-redundant read-only replication should ensure some level of business continuity.

If you need high-availability read-write to storage that can survive a datacenter outage transparently, you would want to build your solution to read and write from storage in two datacenters, probably using a transactional model. That is, you would commit all writes to your primary store, then commit a second write to the backup store; reading would be in the same order.

Data Sync Service
Azure Data Sync example use. The orange servers are spokes, which write to a hub, in green; which in turn is a witness server that replicates to the non-writing DBs. Via

SQL Server Failover

Note that this is not ideal, and you can lose I/O operations to a storage failure, especially if you are not writing to two different datacenter storage solutions.

This is even more pronounced in Azure’s SQL Server offering.

As with regular storage, there are multiple SQL Server database copies in a given instance in each datacenter, each of which is monitored, “failed over” and rebuilt automatically in the event of an error.

However, in the event of a datacenter service outage, maintaining a synced copy of SQL Server can be a challenge.

Azure does offer a syncing service for data that spans regions. However, this service is expensive — it requires at least three databases, one of which is a witness DB — and each DB in the sync group can be out of sync for as long as 5 minutes.

If you’re doing thousands of writes per minute, 5 minutes might as well be 5 million years, of course.

So if you have mission-critical SQL Server data that must be highly available and write-intensive, it makes far more sense to spin up several SQL Server VMs and to use the AlwaysOn feature that’s built into the standard SQL Server install.

Final Notes

Azure is an agile technology; new features are released every few weeks. So if you’re reading this after any significant period of time since it was published, there’s a good chance this information is no longer correct, superseded or otherwise not appropriate to follow.

Keep that in mind.

Also, you’ll notice I’ve not said anything about autoscaling for storage. That’s built-in, if you will. While there are theoretical limits to the amount of data you can store in Azure, it’s practically limitless, and it will simply grow to whatever you need it to be. I/O throughput is uniform, regardless of the endpoint to which you are sending the data.

Of course, if you are communicating externally to a distant datacenter, or your storage account is not directly coupled to your solution through an affinity group (the old way of coupling services) or regional virtual network (the new way of coupling services), you can see delays in network operations.

All links in this post on delicious:

And here’s the slide deck from this talk:

Get these slides on Google Docs


Leave a Reply

  • Check out the Commenting Guidelines before commenting, please!
  • Want to share code? Please put it into a GitHub Gist, CodePen or pastebin and link to that in your comment.
  • Just have a line or two of markup? Wrap them in an appropriate SyntaxHighlighter Evolved shortcode for your programming language, please!