In the previous post, I’ve shared a few things that were useful to me as a developer.
Now wearing my “ops” hat, there are a few things that I wanted to cover - blocking bad clients, rate limiting, caching, and gradual rollout.
Blocking bad clients in nginx is usually implemented with a simple return 403
for some requests. To classify request we can use any builtin
variable, e.g. $http_user_agent
to
match by user agent:
server {
# ...
# Block all bots
if ($http_user_agent ~ ".*bot.*") {
return 403;
}
# ...
}
If you need more conditions to identify bad clients, use the map
to construct
the final variable like this:
http {
# Ban bots using specific API key
map $http_user_agent:$arg_key $ban {
~.*bot.*:1234567890 1;
default 0;
}
server {
# ...
if ($ban = 1) {
return 403;
}
# ...
}
}
Simple and easy. Now, let’s see more involved cases where we need to rate limit some clients.
Rate limiting allows you to throttle requests by some pattern. In nginx it is configured with 2 directives:
limit_req_zone
where you describe the “zone”. A zone contains configuration on how to classify
requests for rate limiting and the actual limits.limit_req
that applies zone to the particular context - http
for global limits, server
per virtual server, and location
for a particular location in a virtual
server.To illustrate this, let’s say we need to implement the following rate limiting configuration:
User-Agent
header.To classify requests you need to provide a key
to the limit_req_zone
. key
is usually some variable, either predefined by nginx or configured by you via
map
. All requests that share some key
value will be tracked in that hash
table for rate limiting.
To setup the global rate limit by IP, we need to provide IP as a key
in
limit_req_zone
. Looking at varindex
for predefined variables you can see the binary_remote_addr
that we will use
like this:
http {
# ...
limit_req_zone $binary_remote_addr zone=global:100m rate=100r/s;
# ...
}
Heads up: if your nginx is not public, i.e. it’s behind another proxy, the
remote address will be incorrectly attributed to the proxy before your nginx.
Use the set_real_ip_from
directive to extract the remote address of the real client from request headers.
Now, to limit search engine crawlers by User-Agent
header we have to use
map
:
http {
# ...
map $http_user_agent $crawler {
~*.*(bot|spider|slurp).* $http_user_agent;
default "";
}
limit_req_zone $crawler zone=crawlers:1M rate=1r/m;
# ...
}
Here we are setting $crawler
variable as a limit_req_zone
key. The key
in
limit_req_zone
must have distinct values for different clients to correctly
attribute request counters. We store the real user agent value for keys, so all
requests with a particular user agent will be accounted as a single stream
regardless of other properties like IP address. If the request is not from a
crawler we use an empty string which disables rate limiting.
Finally, to limit requests by API token we use map
to create a key
variable
for another rate limit zone:
http {
# ...
map $http_authorization $badclients {
~.*6d96270004515a0486bb7f76196a72b40c55a47f.* 6d96270004515a0486bb7f76196a72b40c55a47f;
~.*956f7fd1ae68fecb2b32186415a49c316f769d75.* 956f7fd1ae68fecb2b32186415a49c316f769d75;
default "";
}
# ...
limit_req_zone $badclients zone=badclients:1M rate=1r/s;
}
Here we look into the Authorization
header for API token like Authorization: Bearer 1234567890
. If we matched against a few known tokens we use that value
for $badclients
variable and then again use it as a key
for
limit_req_zone
.
Now, that we have configured 3 rate limit zones we can apply them where it’s needed. Here is the full config:
http {
# ...
# Global rate limit per IP.
# Used when child context doesn't provide rate limiting configuration.
limit_req_zone $binary_remote_addr zone=global:100m rate=100r/s;
limit_req zone=global;
# ...
# Rate limit zone for crawlers
map $http_user_agent $crawler {
~*.*(bot|spider|slurp).* $http_user_agent;
default "";
}
limit_req_zone $crawler zone=crawlers:1M rate=1r/m;
# Rate limit zone for bad clients
map $http_authorization $badclients {
~.*6d96270004515a0486bb7f76196a72b40c55a47f.* 6d96270004515a0486bb7f76196a72b40c55a47f;
~.*956f7fd1ae68fecb2b32186415a49c316f769d75.* 956f7fd1ae68fecb2b32186415a49c316f769d75;
default "";
}
limit_req_zone $badclients zone=badclients:1M rate=1r/s;
server {
listen 80;
server_name www.example.com;
# ...
limit_req zone=crawlers; # Apply to all locations within www.example.com
limit_req zone=global; # Fallback
# ...
}
server {
listen 80;
server_name api.example.com;
# ...
location /heavy/method {
# ...
limit_req zone=badclients; # Apply to a single location serving some heavy method
limit_req zone=global; # Fallback
# ...
}
# ...
}
}
Note that we had to add global
zone as a fallback whenever we have other
limit_req
configurations. That’s needed because nginx fallback to limit_req
defined in the parent context only if the current context doesn’t have any
limit_req
configuration.
So the general pattern for configuring rate limiting is the following:
limit_req
.Rate limit will help to keep your system stable. Now let’s talk about caching that can remove some excessive load from the backends.
One of the greatest features of nginx is its ability to cache responses.
Let’s say we are proxying requests to some backend that returns static data that is expensive to compute. We can shave the load from that backend by caching its response.
Here is how it’s done:
http {
# ...
proxy_cache_path /var/cache/nginx/billing keys_zone=billing:500m max_size=1000m inactive=1d;
# ...
server {
# ...
location /billing {
proxy_pass http://billing_backend/;
# Apply the billing cache zone
proxy_cache billing;
# Override default cache key. Include `Customer-Token` header to distinguish cache values per customer
proxy_cache_key "$scheme$proxy_host$request_uri $http_customer_token";
proxy_cache_valid 200 302 1d;
proxy_cache_valid 404 400 10m;
}
}
}
In this example, we cache responses from the “billing” service that returns
billing information for a client. Imagine that these requests are heavy so we
cache them per customer. We assume that clients access our billing API with the
same URL but provides a Customer-Token
HTTP header to distinguish themselves.
First, caching needs some place where it will store the values. This is
configured with the proxy_cache_path
directive. It needs at least 2 required params - keys_zone
and path. The
keys_zone
gives a name to the cache and sets the size of the hash table to
track cache keys. Path will hold the actual files named after MD5 hash of the
cache key which is, by default, is the full URL of the request. But you can, of
course, configure your own cache key with the proxy_cache_key
directive where
you can use any variables including HTTP headers and cookies.
In our case, we have overridden the default cache key by adding the
$http_customer_token
variable holding the value of the Customer-Token
HTTP
header. This way we will not poison the cache between customers.
Then, as with rate limits, you have to apply the configured cache zone to the
server, location, or globally using proxy_cache
directive. In my example, I’ve applied caching for a single location.
Another important thing to configure from the start is cache invalidation. By default, only responses with 200, 301, and 302 HTTP codes are cached, and values older than 10 minutes will be deleted.
Finally, when proxying requests to upstreams, nginx respects some headers like
Cache-Control
. If that header contains something like no-store, must-revalidate
then nginx will not cache the response. To override this
behavior add proxy_ignore_headers "Cache-Control";
.
So to configure nginx cache invalidation do the following:
max_size
in proxy_cache_path
to bound the amount of disk that
cache will occupy. If the nginx would need to cache more than max_size
it
will evict the least recently used values from the cacheinactive
param in proxy_cache_path
to configure the TTL for the
whole cache zone. You can override it with proxy_cache_valid
directive.proxy_cache_valid
directive that will instruct the TTL for the
cache items in a given location or server and that will set TTL for cache
items.In my example, I’ve configured caching of 200 and 302 responses for a day. And also for error responses I’ve added caching for 10 minutes to avoid thrashing the backend in vain.
Another feature that is rarely used, but when it’s needed it’s a godsend, is a gradual rollout.
Imagine you are doing a massive rewrite of your product. Maybe you’re migrating to a new database system, rewriting backend in Go, or moving to a cloud. Whatever.
Your current version is used by all of the clients and you have deployed the new version alongside. How would switch clients from the current backend to the new one? The obvious choice is to just flip the switch and hope everything will work. But hope is not a good strategy.
You could’ve tested your new version rigorously. You might even do the traffic mirroring to ensure that your new system operates correctly. But anyway, from my experience there is always something that goes wrong - forgotten important header in the response, slightly changed format, rare request that swamps your DB.
I’m sure that it’s better to gradually rollout massive changes. Even a few days helps a lot. Sure, it requires more work to do but it pays off.
The main feature in nginx that provides gradual rollout is a split_client
module. It
works like map
but instead of setting variable by some pattern, it creates the
variable from the source variable distribution. Let me illustrate it:
http {
upstream current {
server backend1;
server backend2;
}
upstream new {
server newone.team.svc max_fails=0;
}
split_clients $arg_key $destination {
5% new;
* current;
}
server {
# ...
location /api {
proxy_pass http://$destination/;
}
}
}
This split_client
configuration does the following - it looks into the key
query argument and for 5% of the values it sets $new_backend
to 1. For the
other 95% of keys, it will set $new_backend
to 0. The way it works is that the
source variable is hashed into a 32-bit hash that produces values from 0 to
4294967296, and the X percent is simply the first 4294967296 * X / 100
values
(for 5% it’s a 4294967296 * 5 / 100 = 214748364
first values).
Just to give you a sense of how the 5% example above behaves, here is what distribution looks like
key | $destination
----+-------------
1 | current
2 | current
3 | current
4 | current
5 | current
6 | current
7 | current
8 | new
9 | current
10 | new
Since split_client
creates a variable you can use it in our beloved map
to
construct more complex examples like this:
http {
upstream current {
server http://backend1/;
server http://backend2/;
}
upstream new {
server http://newone.team.svc/ max_fails=0;
}
split_clients $arg_key $new_api {
5% 1;
* 0;
}
map $new_api:$cookie_app_switch $destination {
~.*:1 new;
~0:.* current;
~1:.* new;
}
server {
# ...
location /api {
proxy_pass http://$destination/;
}
}
}
In this example, we are combining the value from the split_clients
distribution with the value of the app_switch
cookie. If the cookie is set to
1, we set $destination
to new
upstream. Otherwise, we look into the value from
split_clients
. This is a kind of feature flag to test the new system in
production - everyone with the cookie set will always get responses from the
new
upstream.
The distribution of the keys is consistent. If you’ve used API key for
split_clients
then the user with the same API key will always be placed into
the same group.
With this configuration, you can diverge traffic to the new system starting with
some small percentage and gradually increment the percentage. The little
downside here is that you have to change the percentage value in the config and
reload nginx with nginx -s reload
to apply it - there is no builtin API for
that.
Now, let’s talk about nginx logging.
Collecting logs from nginx is a great idea because it’s usually an entrypoint for the clients’ traffic and so it can report actual service experience as customers see it.
To get any profit from logs they should be collected in some central place like Elastic stack or Splunk where you can easily query and even build decent analytics from it. These log management tools require structured data but nginx by default is logging in the so-called “combined” log format which is an unstructured mess that is expensive to parse.
The solution to this is simple - configure structured logging for nginx. We can
do this with the log_format
directive. I always log in JSON format because it’s understood universally. Here
is how to configure JSON logging for nginx:
http {
# ...
log_format json escape=json '{'
'"server_name": "billing-proxy",'
'"ts":"$time_iso8601",'
'"remote_addr":"$remote_addr","host":"$host","origin":"$http_origin","url":"$request_uri",'
'"request_id":"$request_id","upstream":"$upstream_addr",'
'"response_size":"$body_bytes_sent","upstream_response_time":"$upstream_response_time","request_time":"$request_time",'
'"status":"$status"'
'}';
# ...
}
Yes, it’s not the prettiest thing in the world but it does the job. You can use
any variables in the format - builtin in nginx and your own that you defined
with the map
directive.
I use implicit string concatenation here to make it more readable - there are multiple single-quoted strings one after another that nginx will glue together. Inside each string, I use double-quoted strings for JSON fields and values.
The escape=json
option will replace non-printable chars like newlines with
escaped values, e.g. \n
. Quotes and backslash will be escaped too.
With this log format, you don’t need to use the grok
filter in logstash and
painfully parse logs into some structure. If nginx is running in kubernetes all
you have to do is:
filter {
json {
source => "log"
remove_field => ["log"]
}
}
Because logs from containers are wrapped in the JSON where the log message is
store in the "log"
field)
And that’s a wrap for my nginx experience so far. I’ve written about nginx mirroring, shared a few features useful when you develop backends behind nginx and here I’m dumping the rest of my knowledge gained while using nginx in production.