Nice nginx features for operators

June 27, 2021

In the previous post, I’ve shared a few things that were useful to me as a developer.

Now wearing my “ops” hat, there are a few things that I wanted to cover - blocking bad clients, rate limiting, caching, and gradual rollout.

Blocking bad clients

Blocking bad clients in nginx is usually implemented with a simple return 403 for some requests. To classify request we can use any builtin variable, e.g. $http_user_agent to match by user agent:

server {
    # ...

    # Block all bots
    if ($http_user_agent ~ ".*bot.*") {
        return 403;
    }

    # ...
}

If you need more conditions to identify bad clients, use the map to construct the final variable like this:

http {
    # Ban bots using specific API key
    map $http_user_agent:$arg_key $ban {
        ~.*bot.*:1234567890 1;
        default 0;
    }

    server {
    # ...

        if ($ban = 1) {
            return 403;
        }

    # ...
    }
}

Simple and easy. Now, let’s see more involved cases where we need to rate limit some clients.

Rate limiting

Rate limiting allows you to throttle requests by some pattern. In nginx it is configured with 2 directives:

  1. limit_req_zone where you describe the “zone”. A zone contains configuration on how to classify requests for rate limiting and the actual limits.
  2. limit_req that applies zone to the particular context - http for global limits, server per virtual server, and location for a particular location in a virtual server.

To illustrate this, let’s say we need to implement the following rate limiting configuration:

  • Global rate limit of 100 RPS by IP
  • Limit search engine crawlers to 1 RPM. Crawlers are determined by the User-Agent header.
  • Limit requests from some bad client by API token to 1 RPS.

To classify requests you need to provide a key to the limit_req_zone. key is usually some variable, either predefined by nginx or configured by you via map. All requests that share some key value will be tracked in that hash table for rate limiting.

To setup the global rate limit by IP, we need to provide IP as a key in limit_req_zone. Looking at varindex for predefined variables you can see the binary_remote_addr that we will use like this:

http {
    # ...
    limit_req_zone $binary_remote_addr zone=global:100m rate=100r/s;
    # ...
}

Heads up: if your nginx is not public, i.e. it’s behind another proxy, the remote address will be incorrectly attributed to the proxy before your nginx. Use the set_real_ip_from directive to extract the remote address of the real client from request headers.

Now, to limit search engine crawlers by User-Agent header we have to use map:

http {
    # ...
    map $http_user_agent $crawler {
        ~*.*(bot|spider|slurp).* $http_user_agent;
        default "";
    }

    limit_req_zone $crawler zone=crawlers:1M rate=1r/m;
    # ...
}

Here we are setting $crawler variable as a limit_req_zone key. The key in limit_req_zone must have distinct values for different clients to correctly attribute request counters. We store the real user agent value for keys, so all requests with a particular user agent will be accounted as a single stream regardless of other properties like IP address. If the request is not from a crawler we use an empty string which disables rate limiting.

Finally, to limit requests by API token we use map to create a key variable for another rate limit zone:

http {
    # ...
    map $http_authorization $badclients {
        ~.*6d96270004515a0486bb7f76196a72b40c55a47f.* 6d96270004515a0486bb7f76196a72b40c55a47f;
        ~.*956f7fd1ae68fecb2b32186415a49c316f769d75.* 956f7fd1ae68fecb2b32186415a49c316f769d75;
        default "";
    }
    # ...
    limit_req_zone $badclients zone=badclients:1M rate=1r/s;
}

Here we look into the Authorization header for API token like Authorization: Bearer 1234567890. If we matched against a few known tokens we use that value for $badclients variable and then again use it as a key for limit_req_zone.

Now, that we have configured 3 rate limit zones we can apply them where it’s needed. Here is the full config:

http {
    # ...
    # Global rate limit per IP.
    # Used when child context doesn't provide rate limiting configuration.
    limit_req_zone $binary_remote_addr zone=global:100m rate=100r/s;
    limit_req zone=global;
    # ...

    # Rate limit zone for crawlers
    map $http_user_agent $crawler {
        ~*.*(bot|spider|slurp).* $http_user_agent;
        default "";
    }
    limit_req_zone $crawler zone=crawlers:1M rate=1r/m;

    # Rate limit zone for bad clients
    map $http_authorization $badclients {
        ~.*6d96270004515a0486bb7f76196a72b40c55a47f.* 6d96270004515a0486bb7f76196a72b40c55a47f;
        ~.*956f7fd1ae68fecb2b32186415a49c316f769d75.* 956f7fd1ae68fecb2b32186415a49c316f769d75;
        default "";
    }
    limit_req_zone $badclients zone=badclients:1M rate=1r/s;

    server {
        listen 80;
        server_name www.example.com;
        # ...
        limit_req zone=crawlers; # Apply to all locations within www.example.com
        limit_req zone=global; # Fallback
        # ...
    }

    server {
        listen 80;
        server_name api.example.com;
        # ...
        location /heavy/method {
            # ...
            limit_req zone=badclients; # Apply to a single location serving some heavy method
            limit_req zone=global; # Fallback
            # ...
        }
        # ...
    }

}

Note that we had to add global zone as a fallback whenever we have other limit_req configurations. That’s needed because nginx fallback to limit_req defined in the parent context only if the current context doesn’t have any limit_req configuration.

So the general pattern for configuring rate limiting is the following:

  • Prepare variable that will store a key for rate limiting. The keys must be distinct for different rate limiting buckets.
  • Empty key disables rate limiting.
  • Use the variable with rate limiting key to configure rate limiting zone configuration.
  • Apply rate limit zone where needed with limit_req.
  • If you need a fallback configuration, define it together with the configuration on the current level.

Rate limit will help to keep your system stable. Now let’s talk about caching that can remove some excessive load from the backends.

Caching

One of the greatest features of nginx is its ability to cache responses.

Let’s say we are proxying requests to some backend that returns static data that is expensive to compute. We can shave the load from that backend by caching its response.

Here is how it’s done:

http {
    # ...
    proxy_cache_path  /var/cache/nginx/billing keys_zone=billing:500m max_size=1000m inactive=1d;
    # ...

    server {
        # ...
        location /billing {
            proxy_pass http://billing_backend/;

            # Apply the billing cache zone
            proxy_cache billing;

            # Override default cache key. Include `Customer-Token` header to distinguish cache values per customer
            proxy_cache_key "$scheme$proxy_host$request_uri $http_customer_token";

            proxy_cache_valid 200 302 1d;
            proxy_cache_valid 404 400 10m;
        }
    }
}

In this example, we cache responses from the “billing” service that returns billing information for a client. Imagine that these requests are heavy so we cache them per customer. We assume that clients access our billing API with the same URL but provides a Customer-Token HTTP header to distinguish themselves.

First, caching needs some place where it will store the values. This is configured with the proxy_cache_path directive. It needs at least 2 required params - keys_zone and path. The keys_zone gives a name to the cache and sets the size of the hash table to track cache keys. Path will hold the actual files named after MD5 hash of the cache key which is, by default, is the full URL of the request. But you can, of course, configure your own cache key with the proxy_cache_key directive where you can use any variables including HTTP headers and cookies.

In our case, we have overridden the default cache key by adding the $http_customer_token variable holding the value of the Customer-Token HTTP header. This way we will not poison the cache between customers.

Then, as with rate limits, you have to apply the configured cache zone to the server, location, or globally using proxy_cache directive. In my example, I’ve applied caching for a single location.

Another important thing to configure from the start is cache invalidation. By default, only responses with 200, 301, and 302 HTTP codes are cached, and values older than 10 minutes will be deleted.

Finally, when proxying requests to upstreams, nginx respects some headers like Cache-Control. If that header contains something like no-store, must-revalidate then nginx will not cache the response. To override this behavior add proxy_ignore_headers "Cache-Control";.

So to configure nginx cache invalidation do the following:

  • Set the max_size in proxy_cache_path to bound the amount of disk that cache will occupy. If the nginx would need to cache more than max_size it will evict the least recently used values from the cache
  • Set the inactive param in proxy_cache_path to configure the TTL for the whole cache zone. You can override it with proxy_cache_valid directive.
  • Finally, add proxy_cache_valid directive that will instruct the TTL for the cache items in a given location or server and that will set TTL for cache items.

In my example, I’ve configured caching of 200 and 302 responses for a day. And also for error responses I’ve added caching for 10 minutes to avoid thrashing the backend in vain.

Gradual rollout of a new service

Another feature that is rarely used, but when it’s needed it’s a godsend, is a gradual rollout.

Imagine you are doing a massive rewrite of your product. Maybe you’re migrating to a new database system, rewriting backend in Go, or moving to a cloud. Whatever.

Your current version is used by all of the clients and you have deployed the new version alongside. How would switch clients from the current backend to the new one? The obvious choice is to just flip the switch and hope everything will work. But hope is not a good strategy.

You could’ve tested your new version rigorously. You might even do the traffic mirroring to ensure that your new system operates correctly. But anyway, from my experience there is always something that goes wrong - forgotten important header in the response, slightly changed format, rare request that swamps your DB.

I’m sure that it’s better to gradually rollout massive changes. Even a few days helps a lot. Sure, it requires more work to do but it pays off.

The main feature in nginx that provides gradual rollout is a split_client module. It works like map but instead of setting variable by some pattern, it creates the variable from the source variable distribution. Let me illustrate it:

http {
    upstream current {
        server backend1;
        server backend2;
    }

    upstream new {
        server newone.team.svc max_fails=0;
    }

    split_clients $arg_key $destination {
        5% new;
        *  current;
    }

    server {
        # ...
        location /api {
            proxy_pass http://$destination/;
        }
    }
}

This split_client configuration does the following - it looks into the key query argument and for 5% of the values it sets $new_backend to 1. For the other 95% of keys, it will set $new_backend to 0. The way it works is that the source variable is hashed into a 32-bit hash that produces values from 0 to 4294967296, and the X percent is simply the first 4294967296 * X / 100 values (for 5% it’s a 4294967296 * 5 / 100 = 214748364 first values).

Just to give you a sense of how the 5% example above behaves, here is what distribution looks like

key | $destination
----+-------------
1   |   current
2   |   current
3   |   current
4   |   current
5   |   current
6   |   current
7   |   current
8   |   new
9   |   current
10  |   new

Since split_client creates a variable you can use it in our beloved map to construct more complex examples like this:

http {
    upstream current {
        server http://backend1/;
        server http://backend2/;
    }

    upstream new {
        server http://newone.team.svc/ max_fails=0;
    }

    split_clients $arg_key $new_api {
        5% 1;
        *  0;
    }

    map $new_api:$cookie_app_switch $destination {
        ~.*:1 new;
        ~0:.* current;
        ~1:.* new;
    }

    server {
        # ...
        location /api {
            proxy_pass http://$destination/;
        }
    }
}

In this example, we are combining the value from the split_clients distribution with the value of the app_switch cookie. If the cookie is set to 1, we set $destination to new upstream. Otherwise, we look into the value from split_clients. This is a kind of feature flag to test the new system in production - everyone with the cookie set will always get responses from the new upstream.

The distribution of the keys is consistent. If you’ve used API key for split_clients then the user with the same API key will always be placed into the same group.

With this configuration, you can diverge traffic to the new system starting with some small percentage and gradually increment the percentage. The little downside here is that you have to change the percentage value in the config and reload nginx with nginx -s reload to apply it - there is no builtin API for that.

Now, let’s talk about nginx logging.

Structured logs

Collecting logs from nginx is a great idea because it’s usually an entrypoint for the clients' traffic and so it can report actual service experience as customers see it.

To get any profit from logs they should be collected in some central place like Elastic stack or Splunk where you can easily query and even build decent analytics from it. These log management tools require structured data but nginx by default is logging in the so-called “combined” log format which is an unstructured mess that is expensive to parse.

The solution to this is simple - configure structured logging for nginx. We can do this with the log_format directive. I always log in JSON format because it’s understood universally. Here is how to configure JSON logging for nginx:

http {
    # ...
    log_format json escape=json '{'
        '"server_name": "billing-proxy",'
        '"ts":"$time_iso8601",'
        '"remote_addr":"$remote_addr","host":"$host","origin":"$http_origin","url":"$request_uri",'
        '"request_id":"$request_id","upstream":"$upstream_addr",'
        '"response_size":"$body_bytes_sent","upstream_response_time":"$upstream_response_time","request_time":"$request_time",'
        '"status":"$status"'
        '}';
    # ...
}

Yes, it’s not the prettiest thing in the world but it does the job. You can use any variables in the format - builtin in nginx and your own that you defined with the map directive.

I use implicit string concatenation here to make it more readable - there are multiple single-quoted strings one after another that nginx will glue together. Inside each string, I use double-quoted strings for JSON fields and values.

The escape=json option will replace non-printable chars like newlines with escaped values, e.g. \n. Quotes and backslash will be escaped too.

With this log format, you don’t need to use the grok filter in logstash and painfully parse logs into some structure. If nginx is running in kubernetes all you have to do is:

filter {
    json {
        source => "log"
        remove_field => ["log"]
    }
}

Because logs from containers are wrapped in the JSON where the log message is store in the "log" field)

Conclusion

And that’s a wrap for my nginx experience so far. I’ve written about nginx mirroring, shared a few features useful when you develop backends behind nginx and here I’m dumping the rest of my knowledge gained while using nginx in production.