Published on

Azure Table Storage: Design Patterns for Query Optimization

Table of Contents

While Azure Table Storage does store data in a structured format, it fundamentally differs from traditional relational databases. Some of the design techniques we're accustomed to from our knowledge of relational databases might not be applicable here. That's why it's crucial to familiarize ourselves with the design patterns and techniques specific to Azure Table Storage.

If you're new to Azure Table Storage, I strongly recommend starting with the Introduction to Azure Table Storage and Getting Started with Azure Table Storage in .NET posts for a rock-solid foundation.

Types of Queries

Before proceeding, it's important to understand various types of queries we can execute against Azure Table Storage. This categorization is done on the basis of query's performance.

For the examples below, assume an entity Employee with the following structure.

PropertyType
PartitionKey (Department Name)string
RowKey (Employee Id)string
FirstNamestring
LastNamestring
EmailAddressstring

1. Point Query

Since Azure Table Storage is a key-value store, a Point Query, where we specify both PartitionKey and the RowKey is the most efficient one. When we require the lowest latency read operations then we should design the tables in such a way that we can execute point queries.

await tableClient.GetEntityAsync<Employee>("Development", "1");

2. Range Query

If we want to retrieve multiple entities from the table, we can use a Range Query. A Range Query is one where we specify the PartitionKey and a subset of RowKey values. For example,

tableClient.QueryAsync<Employee>(filter: $"PartitionKey eq 'Development' and RowKey > '5'");

3. Partition Scan

A Partition Scan takes place when we do not specify a RowKey. This type of query is relatively inefficient because it involves scanning the entire partition to find the result.

tableClient.QueryAsync<Employee>(e => e.PartitionKey == "Development" && e.FirstName == "John");

4. Table Scan

A Table Scan occurs when we do not specify a PartitionKey. It searches all of the partitions for matching entities, regardless of whether the filter includes a RowKey or not. In a benchmark that I performed a table scan turned out to be 1800 times slower than a point query. So it's better if we avoid it as it is highly inefficient.

tableClient.QueryAsync<Employee>(e => e.FirstName == "John" && e.LastName == "Hopper");

Now, let's look at some common design patterns that we use when working with Azure Table Storage.

Secondary Index Pattern

Table Storage creates an index using PartitionKey and RowKey values, which allows us to efficiently retrieve an entity using these values. However, if we want to retrieve an entity using another property, such as email, then we have to use the less efficient Partition Scan.

Problem

How can we create an index on a non-key property, making the retrieval efficient through a Point Query?

Solution

We can store multiple copies of each entity with different RowKey values.

By saving the entity this way, we get the ability to execute point queries by using an email or an employee id.

// Query using employee id
await tableClient.GetEntityAsync<Employee>("Development", "empid_1");

// Query using email
await tableClient.GetEntityAsync<Employee>("Development", "email_johnhopper@gmail.com");

Since Azure Storage is cheap, storing multiple copies of an entity should not be a concern. However, keeping copies in consistent state should be. For that, we can use Entity Group Transactions (EGTs). By using an EGT, we can update multiple entities within the same partition atomically.

Implementation

This is what a repository implementing a Secondary Index pattern would look like.

EmployeeRepository.cs
public class EmployeeRepository
{
    private readonly TableClient client;

    public EmployeeRepository()
    {
        client = new TableClient("<connection string>", "Employee");
    }

    public async Task Add(Employee employee)
    {
        var entities = new List<ITableEntity>()
        { 
            new EmployeeTableEntity(employee.Department, $"empid_{employee.Id}", employee),
            new EmployeeTableEntity(employee.Department, $"email_{employee.EmailAddress}", employee)
        };

        var batch = new List<TableTransactionAction>();
        batch.AddRange(entities.Select(e => new TableTransactionAction(TableTransactionActionType.Add, e)));
        await client.SubmitTransactionAsync(batch);
    }

    public Task<Employee?> GetById(string departmentName, Guid id)
    {
        return Get(departmentName, $"empid_{id}");
    }

    public Task<Employee?> GetByEmail(string departmentName, string email)
    {
        return Get(departmentName, $"email_{email}");
    }

    private async Task<Employee?> Get(string partitionKey, string rowKey)
    {
        var response = await client.GetEntityIfExistsAsync<EmployeeTableEntity>(partitionKey, rowKey);
        if (!response.HasValue)
        {
            return null;
        }

        var e = response.Value;
        return new Employee { Id = e.Id, City = e.City, Department = e.Department, FirstName = e.FirstName, LastName = e.LastName, EmailAddress = e.EmailAddress };
    }
}

Pay attention to how we're creating a batch and submitting it for a transaction; this is known as Entity Group Transactions (EGT). In a similar fashion, we can execute Update and Delete operations by specifying the TableTransactionActionType as TableTransactionActionType.UpdateMerge, TableTransactionActionType.UpdateReplace, or TableTransactionActionType.Delete, depending on our requirements.

Here's what supporting classes look like.

Employee.cs
public class Employee
{
    public Employee() { }

    public Employee(string firstName, string lastName, string emailAddress, string city, string department)
    {
        Id = Guid.NewGuid();
        FirstName = firstName;
        LastName = lastName;
        EmailAddress = emailAddress;
        City = city;
        Department = department;
    }

    public Guid Id { get; set; }

    public string FirstName { get; set; }

    public string LastName { get; set; }

    public string EmailAddress { get; set; }

    public string City { get; set; }

    public string Department { get; set; }
}
EmployeeTableEntity.cs
public class EmployeeTableEntity : ITableEntity
{
    public EmployeeTableEntity() { }

    public EmployeeTableEntity(string partitionKey, string rowKey, Employee employee)
    {
        PartitionKey = partitionKey;
        RowKey = rowKey;
        Id = employee.Id;
        FirstName = employee.FirstName;
        LastName = employee.LastName;
        EmailAddress = employee.EmailAddress;
        City = employee.City;
        Department = employee.Department;
    }

    public Guid Id { get; set; }

    public string FirstName { get; set; }

    public string LastName { get; set; }

    public string EmailAddress { get; set; }

    public string City { get; set; }

    public string Department { get; set; }

    public string PartitionKey { get; set; }

    public string RowKey { get; set; }

    public DateTimeOffset? Timestamp { get; set; }

    public ETag ETag { get; set; }
}

Index Entities Pattern

Suppose we have 10,000 employees in one partition and our goal is to find all employees living in the city of Palo Alto. If we were to execute the query below, then it would perform a Partition Scan to find the matching entities, even if only 5 employees were living in Palo Alto out of the 10,000.

tableClient.QueryAsync<EmployeeTableEntity>(e => e.PartitionKey == "Development" && e.City == "Palo Alto");

Problem

In such scenarios, when we need to search for entities based on a property that is neither key nor unique, How can we retrieve entities efficiently using Point Queries or Range Queries?

Solution

We'll create special entities called the Index Entities within the same partition. These entities will store an array of entity IDs that belong to a specific value of our target property.

Then, instead of directly searching for the entities using the property value within the partition, We'll retrieve the index entity for that value and use the stored IDs to directly fetch the entities.

Make sure, to keep the index entities in a consistent state using EGTs when inserting, updating, or deleting the entities.

Implementation

Below is the Index Entity class.

EmployeeIndexEntity.cs
public class EmployeeIndexEntity : ITableEntity
{
    public EmployeeIndexEntity() { }

    public EmployeeIndexEntity(string partitionKey, string rowKey)
    {
        PartitionKey = partitionKey;
        RowKey = rowKey;
    }

    public string EmployeeIds { get; set; }

    public string PartitionKey { get; set; }

    public string RowKey { get; set; }

    public DateTimeOffset? Timestamp { get; set; }

    public ETag ETag { get; set; }

    public void AddEmployeeId(string employeeId)
    {
        if (string.IsNullOrEmpty(EmployeeIds)) 
        {
            EmployeeIds = employeeId;
        }
        else if (!EmployeeIds.Contains(employeeId)) 
        {
            EmployeeIds += $",{employeeId}";
        }
    }
}

And here is the implementation of Repository class.

EmployeeRepository.cs
public class EmployeeRepository
{
    private readonly TableClient client;

    public EmployeeRepository()
    {
        client = new TableClient("<connection string>", "Employee");
    }

    public async Task Add(Employee employee)
    {
        var employeeEntity = new EmployeeTableEntity(employee.Department, employee.Id.ToString(), employee);

        var batch = new List<TableTransactionAction>();
        batch.Add(new TableTransactionAction(TableTransactionActionType.Add, employeeEntity));

        EmployeeIndexEntity indexEntity = null!;
        var response = await client.GetEntityIfExistsAsync<EmployeeIndexEntity>(employee.Department, employee.City);
        if (response.HasValue)
        {
            indexEntity = response.Value;
            indexEntity.AddEmployeeId(employee.Id.ToString());
            batch.Add(new TableTransactionAction(TableTransactionActionType.UpdateReplace, indexEntity));
        }
        else
        {
            indexEntity = new EmployeeIndexEntity(employee.Department, employee.City);
            indexEntity.AddEmployeeId(employee.Id.ToString());
            batch.Add(new TableTransactionAction(TableTransactionActionType.Add, indexEntity));
        }

        await client.SubmitTransactionAsync(batch);
    }

    public async Task<List<Employee>> GetByCity(string departmentName, string city)
    {
        var response = await client.GetEntityIfExistsAsync<EmployeeIndexEntity>(departmentName, city);
        if (!response.HasValue)
        {
            return new List<Employee>();
        }

        var tasks = new List<Task<Azure.Response<EmployeeTableEntity>>>();
        foreach (var employeeId in response.Value.EmployeeIds.Split(',')) 
        {
            tasks.Add(client.GetEntityAsync<EmployeeTableEntity>(departmentName, employeeId));
        }

        await Task.WhenAll(tasks);

        return tasks.Select(t => t.Result)
            .Where(r => r.HasValue)
            .Select(r => r.Value)
            .Select(e => new Employee { Id = e.Id, City = e.City, Department = e.Department, FirstName = e.FirstName, LastName = e.LastName})
            .ToList();
    }
}

In the Add() method, we are persisting both the EmployeeTableEntity and EmployeeIndexEntity simultaneously, using an EGT.

In this case, we've chosen to use the City as the RowKey since we're constructing an index based on it. However, if the requirement arises to index multiple properties, we can use a Compound key as the RowKey. For instance, we could use keys like "City_Palo Alto" or "LastName_Jones" to facilitate indexing based on both city and last name.

In the GetByCity() method, we first fetch the index entity for the given city and then retrieve all associated entities using the IDs stored in that index entity. Since Table Storage lacks support for an in clause, we cannot retrieve multiple entities in a single query. Therefore, we perform parallel loading of these entities to optimize the process.

Conclusion

In conclusion, Azure Table Storage offers a unique approach to storing structured data that differs significantly from traditional relational databases. To harness its full potential, it's crucial to understand and implement design patterns specific to Azure Table Storage. These design patterns provide solutions for common data retrieval problems when working with Azure Table Storage.

What's Next