Published on

Making Microservices Fault Tolerant with Code Examples

Table of Contents

In the world of microservices, where software systems are divided into small, independent components, fault tolerance becomes an essential factor for success. But what exactly is fault tolerance, and why does it matter?

Fault tolerance is the ability of a system to continue operating, even if some of its components fail or are interrupted.

A component could be anything, like a database, message broker, storage, or even a service. Any one or more of these could fail, but a system should handle that gracefully and stop the ripple effect as close to the source as possible.

In the context of microservices, it becomes even more important as we have even more components in the form of these services. So we need to design and implement mechanisms that ensure individual services can handle unexpected errors or failures without causing a complete system breakdown.

Fault Tolerance in Microservices

The most common breaking point in microservices is inter-service communication. When multiple services interact with each other to fulfill a specific task, errors can occur during this communication process. To make microservices fault tolerant, it is crucial to address these potential errors and ensure a reliable communication mechanism.

In microservices architecture, there are two types of communication:

  • Synchronous
  • Asynchronous

Asynchronous communication is already inherently fault tolerant because it leverages intermediaries such as message queues. These intermediaries decouple services and provide a buffer that can handle intermittent failures. Consequently, asynchronous communication naturally lends itself to fault tolerance.

While infrastructure components going down is rare due to reliable cloud-based resources, we still need to address these potential errors and handle them appropriately but this post will primarily focus on communication-related failures.

On the other hand, synchronous communication requires immediate responses and can be more prone to failures. However, several patterns and techniques can be employed to make synchronous communication fault-tolerant:

  1. Timeout
  2. Retry
  3. Exponential Backoff
  4. Circuit Breaker
  5. Rate Limiting
  6. Caching

Timeouts

Timeouts define the maximum duration for a service to receive a response before considering it a failure.

By setting timeouts, microservices can prevent long waiting periods, and gracefully handle scenarios where a service may be experiencing issues or delays. Timeouts provide a safety net, enabling services to recover quickly from potential failures and maintain system stability.

Consider the following example where Catalog microservice calls Cart microservice to add a Product to the cart.

CartServiceClient.cs
public async Task AddProductToCart(string cartServiceUrl, string productId)
{
    var timeoutPolicy = Policy.TimeoutAsync(TimeSpan.FromSeconds(5), TimeoutStrategy.Pessimistic);

    var policyWrap = Policy.WrapAsync(timeoutPolicy);

    await policyWrap.ExecuteAsync(async () =>
    {
        var requestUrl = $"{cartServiceUrl}/cart/add?productId={productId}";
        var response = await httpClient.PostAsync(requestUrl, null);
        response.EnsureSuccessStatusCode();
    });
}

In the above example we used Polly, a resilience library, for C#. It is used to define a timeout policy, setting a maximum duration for the HTTP request completion, specifically 5 seconds. The ExecuteAsync method is then used to execute the request, and if the request exceeds the specified timeout, a TimeoutRejectedException is thrown, which can be caught and handled appropriately.

By using Polly's timeout policy, we made sure that the catalog microservice's request to add a product to the cart is bounded by a specific timeout duration, preventing delays, and handling timeouts gracefully.

Retry

Retry, as the name suggests, is retrying a failed operation again. Sometimes, there are transient errors (errors which resolve itself) in services due to which communication might be failing. Here are a few examples of transient errors:

  • Temporary network disruption due to network congestion.
  • Service unavailable due to restart.
  • Resource exhaustion such as database reaching its connection limit.

Now, in all of the above cases, if we just retry the operation after some time, the chances are that the operation would most probably succeed. These errors might have been resolved by that time.

Retrying a failed operation allows the system to recover from temporary issues and increase the chances of a successful request.

CartServiceClient.cs
public async Task AddProductToCart(string cartServiceUrl, string productId)
{
    var timeoutPolicy = Policy.TimeoutAsync(TimeSpan.FromSeconds(5), TimeoutStrategy.Pessimistic);

    var retryPolicy = Policy
        .Handle<HttpRequestException>()
        .Or<TimeoutRejectedException>()
        .WaitAndRetryAsync(
            retryCount: 3, // Number of retries
            sleepDurationProvider: retryAttempt => TimeSpan.FromSeconds(10),                onRetry: (exception, timeSpan, retryCount, context) =>
            {
                Console.WriteLine($"Retry {retryCount} due to {exception}. Waiting for {timeSpan} before retrying...");
            }
        );

    var policyWrap = Policy.WrapAsync(
        timeoutPolicy,
        retryPolicy
        // Add more policies here if needed (e.g., circuit breakers)
    );

    await policyWrap.ExecuteAsync(async () =>
    {
        var requestUrl = $"{cartServiceUrl}/cart/add?productId={productId}";
        var response = await httpClient.PostAsync(requestUrl, null);
        response.EnsureSuccessStatusCode();
    });
}

In the updated code example, we introduce a retryPolicy using Polly's WaitAndRetryAsync method. This policy is defined to handle HttpRequestException (indicating a general network failure) and TimeoutRejectedException (indicating a timeout scenario) as the retryable exceptions.

The WaitAndRetryAsync method specifies the number of retries (in this case, 3) and the sleep duration between retries (in this case, 10 seconds).

Additionally, an onRetry delegate is provided to log the retry attempts, including the exception that triggered the retry, the time to wait before the next retry, the current retry count, and any contextual information.

The retryPolicy is then added to the policyWrap along with the existing timeoutPolicy. Multiple policies can be combined within the WrapAsync method to provide a comprehensive fault tolerance mechanism.

By implementing retries using Polly in this manner, the example ensures that the AddProductToCart method will automatically retry the HTTP request in case of failures, such as network issues or timeouts. This improves the resilience of the microservice by allowing it to recover from transient failures and increases the likelihood of a successful request.

Exponential Backoff

Exponential Backoff is an extension to the Retry strategy. It introduces a delay before retrying the operation, and with each subsequent retry the delay also increases.

The purpose of exponential backoff is to stop the system from getting overwhelmed with too many retries, which could make the problem worse. By introducing increasing delays between retries, the technique allows for a more gradual recovery and provides a higher chance of success once the underlying issue is resolved.

We can update the above code example, to introduce an exponential backoff in the following way.

var retryPolicy = Policy
    .Handle<HttpRequestException>()
    .Or<TimeoutRejectedException>()
    .WaitAndRetryAsync(
        retryCount: 3, // Number of retries
        sleepDurationProvider: retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)), // Exponential backoff strategy
        onRetry: (exception, timeSpan, retryCount, context) =>
        {
            Console.WriteLine($"Retry {retryCount} due to {exception}. Waiting for {timeSpan} before retrying...");
        }
    );

In the above code example, we are increasing the delay exponentially with each retry attempt.

Circuit Breaker

When a certain number of failures is reached, the circuit breaker strategy does not allow subsequent requests to be made for a certain period of time, reducing the load on the failing service and allowing it to recover. It is similar to Exponential Backoff in the sense that it allows the failing service to recover. However, the difference is that while the Circuit Breaker is active, no new requests can be made to the failing service or even retried.

The below example, demonstrates how we can add Circuit Breaker strategy.

using System;
using System.Net.Http;
using Polly;
using Polly.CircuitBreaker;
using Polly.Timeout;
using System.Threading.Tasks;

public class CartService
{
    private static readonly HttpClient httpClient = new HttpClient();

    private static readonly CircuitBreakerPolicy circuitBreakerPolicy = Policy
        .Handle<HttpRequestException>()
        .Or<TimeoutRejectedException>()
        .CircuitBreakerAsync(
            exceptionsAllowedBeforeBreaking: 3, // Number of failures before circuit breaker trips
            durationOfBreak: TimeSpan.FromSeconds(30), // Duration of the break after tripping
            onBreak: (exception, timespan) =>
            {
                Console.WriteLine($"Circuit breaker tripped due to {exception}. The circuit will remain open for {timespan.TotalSeconds} seconds.");
            },
            onReset: () =>
            {
                Console.WriteLine("Circuit breaker reset. The circuit is now closed.");
            }
        );

    public async Task AddProductToCart(string cartServiceUrl, string productId)
    {
        var timeoutPolicy = Policy.TimeoutAsync(TimeSpan.FromSeconds(5), TimeoutStrategy.Pessimistic);

        var retryPolicy = Policy
            .Handle<HttpRequestException>()
            .Or<TimeoutRejectedException>()
            .WaitAndRetryAsync(
                retryCount: 3, // Number of retries
                sleepDurationProvider: retryAttempt => TimeSpan.FromSeconds(Math.Pow(2, retryAttempt)), // Exponential backoff strategy
                onRetry: (exception, timeSpan, retryCount, context) =>
                {
                    Console.WriteLine($"Retry {retryCount} due to {exception}. Waiting for {timeSpan} before retrying...");
                }
            );

        var policyWrap = Policy.WrapAsync(
            timeoutPolicy,
            retryPolicy,
            circuitBreakerPolicy
        );

        await policyWrap.ExecuteAsync(async () =>
        {
            var requestUrl = $"{cartServiceUrl}/cart/add?productId={productId}";
            var response = await httpClient.PostAsync(requestUrl, null);
            response.EnsureSuccessStatusCode();
        });
    }
}

In the updated code example, we introduce a circuitBreakerPolicy using Polly's CircuitBreakerAsync method. This policy is configured to handle HttpRequestException and TimeoutRejectedException as the considered exceptions for tripping the circuit breaker.

The exceptionsAllowedBeforeBreaking parameter specifies the number of consecutive failures required to trip the circuit breaker. Once the threshold is reached, subsequent requests will not be attempted for a specified durationOfBreak, which is set to 30 seconds in this example.

The onBreak delegate is invoked when the circuit breaker trips, providing information about the exception that triggered the trip and the duration of the break. Similarly, the onReset delegate is called when the circuit breaker is reset and the circuit is closed again.

The circuitBreakerPolicy is then added to the policyWrap along with the existing timeoutPolicy and retryPolicy. Multiple policies can be combined within the WrapAsync method to provide comprehensive fault tolerance mechanisms.

Rate Limiting

Rate limiting is a technique used to control and limit the number of requests or operations allowed within a certain timeframe. In the context of fault tolerance in microservices, rate limiting helps prevent overwhelming a service or resource with an excessive number of requests, which can lead to performance degradation or service disruption.

Implementing rate limiting ensures that the system operates within defined capacity limits and handles incoming requests in a controlled manner. It helps protect against abusive or malicious usage, manages resource utilization, and improves overall system stability and reliability.

using System;
using System.Net.Http;
using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.DependencyInjection;

var builder = WebApplication.CreateBuilder(args);
builder.Services.AddRateLimiter(options =>
{
    options.AddPolicy("MyRateLimitPolicy", policy =>
    {
        policy.Limit("10/minute"); // Allow 10 requests per minute
        policy.Endpoint().Path("/cart/add"); // Apply rate limiting to specific endpoint
    });
});
var app = builder.Build();

app.Map("/cart/add", HandleAddToCartRequest);

app.Run();

async Task HandleAddToCartRequest(HttpContext context)
{
    var httpClient = context.RequestServices.GetRequiredService<IHttpClientFactory>().CreateClient();
    var response = await httpClient.PostAsync("https://your-cart-service-url.com/cart/add", null);
    response.EnsureSuccessStatusCode();
}

In this code example, we leverage the rate limiting feature introduced in .NET 7 by adding the Microsoft.AspNetCore.RateLimiting package to your project.

The AddRateLimiter method is used to configure rate limiting policies within the service collection. Here, we define a policy named "MyRateLimitPolicy" that allows 10 requests per minute. We also specify that the rate limiting should be applied to the /cart/add endpoint.

In the HandleAddToCartRequest method, we retrieve an instance of IHttpClientFactory from the request services to create an HttpClient and make the POST request to the cart service's endpoint.

Please note that the code provided assumes you have the necessary configuration and routing set up in your application. Adjust the endpoint URL and any other configurations as per your specific scenario.

By utilizing the rate limiting feature in .NET 7, you can easily configure and apply rate limiting policies to specific endpoints or globally, providing efficient control over the rate of incoming requests and enhancing fault tolerance in your microservices architecture.

Caching

Caching is a technique used in microservices architectures to improve performance, reduce latency, and enhance fault tolerance. It involves storing frequently accessed or computed data in a cache, which is a temporary storage location that allows subsequent requests for the same data to be served quickly without the need to recompute or retrieve the data from the original source.

In the context of fault tolerance, caching can play a crucial role in mitigating the impact of failures or performance issues. By caching data, microservices can reduce their reliance on external dependencies or downstream services, thereby reducing the chances of failures due to network issues or service unavailability. Caching helps to provide a fallback mechanism when the original data source is not accessible, improving the overall resilience of the system.

To illustrate how caching can be implemented in C#, let's consider a code example using the built-in caching features in .NET:

using Microsoft.Extensions.Caching.Memory;
using System;

public class ProductService
{
    private readonly IMemoryCache _cache;

    public ProductService(IMemoryCache cache)
    {
        _cache = cache;
    }

    public Product GetProductDetails(string productId)
    {
        var cacheKey = $"Product_{productId}";
        if (!_cache.TryGetValue(cacheKey, out Product product))
        {
            // Data not found in cache, retrieve from the original source
            product = GetProductFromDatabase(productId);

            // Store the retrieved data in the cache
            var cacheEntryOptions = new MemoryCacheEntryOptions()
                .SetSlidingExpiration(TimeSpan.FromMinutes(10)); // Cache entry expires after 10 minutes
            _cache.Set(cacheKey, product, cacheEntryOptions);
        }

        return product;
    }

    private Product GetProductFromDatabase(string productId)
    {
        // Code to fetch the product details from the database
        // Replace this with the actual implementation
        // For example:
        // var product = _dbContext.Products.FirstOrDefault(p => p.Id == productId);
        // return product;
    }
}

In the provided code example, we utilize the IMemoryCache interface from the Microsoft.Extensions.Caching.Memory namespace, which is part of the .NET ecosystem. The IMemoryCache interface provides a simple in-memory cache implementation.

The ProductService class takes an IMemoryCache instance as a dependency in its constructor. The GetProductDetails method first checks if the desired product is available in the cache by using a cache key specific to the product. If the data is found in the cache, it is retrieved and returned directly.

If the data is not found in the cache, it is fetched from the original data source (in this case, a database) using the GetProductFromDatabase method. The retrieved data is then stored in the cache with an expiration time of 10 minutes using MemoryCacheEntryOptions.

By implementing caching in this manner, microservices can reduce the reliance on external services and improve fault tolerance. Frequent requests for the same data can be served from the cache, reducing the load on the original source and providing a fallback option in case of unavailability.

It's important to note that caching strategies should consider data consistency and cache invalidation mechanisms to ensure that the cached data remains up-to-date and reflects the latest changes. Additionally, caching should be applied judiciously, considering the nature of the data and its expiration requirements, to strike a balance between performance and data freshness.

Conclusion

You can use all of these patterns or combination of them. Each of them tries to mitigate a specific problem and these are not a silver bullet for all problems.

By employing these patterns, synchronous communication can be made more fault-tolerant. Each pattern addresses different aspects of potential failures, such as unresponsiveness, temporary issues, and excessive load. Applying these techniques enhances the resilience of microservices and improves the overall fault tolerance of the system.

Remember, fault tolerance in microservices is not about completely eliminating failures but rather mitigating their impact and ensuring the system can handle them gracefully. These patterns provide strategies to handle common failure scenarios in synchronous communication and contribute to building a robust and reliable microservices architecture.