Handling Transient Failures in Azure Using Polly

Transient failures in any cloud environment are one of the major challenges in running a scalable and performant application. At the end of the day, it does not matter who your cloud provider is, your application is going to encounter transient failures. A successful architecture of application is required to ensure that the implementation has ways to handle these transient failures in a way that will allow minimal disruptions to its users.

There are different types of transient failures. Following are few key sources of the transient failures.

  • Loss of connection to a Database
  • Loss of connection to a cache storage like Redis cache
  • Operation timeout for services like microservices, database, Redis etc.
  • Loss of connectivity with Message Bus services

There are other transient failures that can happen in the application. Above list represents the one that you are most likely to encounter.

For any cloud hosted application, there is no way to avoid these issues. A lot of time these errors happen because there could be an urgent infrastructure maintenance taking place that could have an indirect impact on stability of certain services of your application. Cloud vendor can not fit schedule of every client to ensure that it is a planned outage. So, you just have to deal with it.

Our cloud hosted solutions are not any exception. One of the most common errors I encounter in the applications is request timeout issues with Redis cache servers. You may see errors like below in your logs if you are using Redis as your distributed cache provider.

StackExchange.Redis.RedisTimeoutException: Timeout awaiting response 
 (outbound=371KiB, inbound=0KiB, 5969ms elapsed, timeout is 5000ms),
 command=EVAL, next: EVAL, inst: 0, qu: 0, qs: 1, aw: False, rs: ReadAsync, ws: Idle, in: 0, in-pipe: 0, out-pipe: 0,
 serverEndpoint: 192.168.86.36:6379, mc: 1/1/0, mgr: 10 of 10 available, clientName: mymachinename, 
IOCP: (Busy=0,Free=1000,Min=32,Max=1000), WORKER: (Busy=19,Free=32748,Min=50,Max=32767)

This error simply means that when application was trying to read data, there was a transient timeout for that period of time. If you do not handle this correctly, then your application is not going to work correctly. It is not a permanent error and your application should not just fail this request immediately. You must have some retry or other strategy in place to attempt this request again before making a decision that this seems to be a permanent failure.

There is an open source .Net library Polly that provides lot of out of the box support for different retry/circuit breaker etc. kind of strategies. In my experience, handling of 90% of the transient failures can be accomplished by using this library. In some special case, you may have to implement your own code to handle the transient failures.

Following code snippet shows how Polly is used to handle distributed cache timeout errors using Wait and Retry strategy.

private Polly.AsyncPolicy GetAsyncCacheAccessRetryPolicy()
        {
            var retries = 0;
            var retryPolicy = Policy.Handle().WaitAndRetryAsync(
                ResilienceDefaults.DistributedCacheRetryCount,
                attempt => TimeSpan.FromMilliseconds(2 * Math.Pow(2, attempt) * 100),
                (ex, calculatedWaitDuration) =>
                {
                    retries++;
                    Debug.WriteLine(
                        $"Policy retry: Attempt({retries}):Waited({calculatedWaitDuration.TotalMilliseconds}ms): 
                             {ex.Message}");
                });
            return retryPolicy;
        }

var retryPolicy = GetAsyncCacheAccessRetryPolicy();
return await retryPolicy.ExecuteAsync(async () =>
  {
       var data = await _cache.StringGetAsync(GetRedisKey(key));

       if (string.IsNullOrEmpty(json))
       {
          return default;
       }

      if (!data.HasValue) return default;
      var item = JsonConvert.DeserializeObject(data);
      return item;
  });

The implementation allows a set number of retries. The time between each retry is increased. A time back-off strategy works very well in timeout related failures. It gives sufficient time for the network congestion to settle down between each attempt.

This is just one example of handling of a transient failures in Azure environment. Your case could be different. So, you could use a different strategy. In my experience I have found that these Redis timeout failures do resolve with in 1 or 2 retries. If retry strategy is not able to resolve the error in 4-5 retries, then you will have to go into a fall back plan. E.g in case of distributed cache, you could temporarily switch to a in-memory cache mechanism. As soon as the network congestion resolves, then you could switch back to using Redis cache distributed cache storage.

Search

Social

Weather

-0.8 °C / 30.6 °F

weather conditions Mist

Monthly Posts

Blog Tags