There is no magic here

Nice nginx features for operators

2021-06-27T00:00:00+00:00

In the previous post, I’ve shared a few things that were useful to me as a developer.

Now wearing my “ops” hat, there are a few things that I wanted to cover - blocking bad clients, rate limiting, caching, and gradual rollout.

Blocking bad clients

Blocking bad clients in nginx is usually implemented with a simple return 403 for some requests. To classify request we can use any builtin variable, e.g. $http_user_agent to match by user agent:

server {
    # ...

    # Block all bots
    if ($http_user_agent ~ ".*bot.*") {
        return 403;
    }

    # ...
}

If you need more conditions to identify bad clients, use the map to construct the final variable like this:

http {
    # Ban bots using specific API key
    map $http_user_agent:$arg_key $ban {
        ~.*bot.*:1234567890 1;
        default 0;
    }

    server {
    # ...

        if ($ban = 1) {
            return 403;
        }

    # ...
    }
}

Simple and easy. Now, let’s see more involved cases where we need to rate limit some clients.

Rate limiting

Rate limiting allows you to throttle requests by some pattern. In nginx it is configured with 2 directives:

limit_req_zone where you describe the “zone”. A zone contains configuration on how to classify requests for rate limiting and the actual limits.
limit_req that applies zone to the particular context - http for global limits, server per virtual server, and location for a particular location in a virtual server.

To illustrate this, let’s say we need to implement the following rate limiting configuration:

Global rate limit of 100 RPS by IP
Limit search engine crawlers to 1 RPM. Crawlers are determined by the User-Agent header.
Limit requests from some bad client by API token to 1 RPS.

To classify requests you need to provide a key to the limit_req_zone. key is usually some variable, either predefined by nginx or configured by you via map. All requests that share some key value will be tracked in that hash table for rate limiting.

To setup the global rate limit by IP, we need to provide IP as a key in limit_req_zone. Looking at varindex for predefined variables you can see the binary_remote_addr that we will use like this:

http {
    # ...
    limit_req_zone $binary_remote_addr zone=global:100m rate=100r/s;
    # ...
}

Heads up: if your nginx is not public, i.e. it’s behind another proxy, the remote address will be incorrectly attributed to the proxy before your nginx. Use the set_real_ip_from directive to extract the remote address of the real client from request headers.

Now, to limit search engine crawlers by User-Agent header we have to use map:

http {
    # ...
    map $http_user_agent $crawler {
        ~*.*(bot|spider|slurp).* $http_user_agent;
        default "";
    }

    limit_req_zone $crawler zone=crawlers:1M rate=1r/m;
    # ...
}

Here we are setting $crawler variable as a limit_req_zone key. The key in limit_req_zone must have distinct values for different clients to correctly attribute request counters. We store the real user agent value for keys, so all requests with a particular user agent will be accounted as a single stream regardless of other properties like IP address. If the request is not from a crawler we use an empty string which disables rate limiting.

Finally, to limit requests by API token we use map to create a key variable for another rate limit zone:

http {
    # ...
    map $http_authorization $badclients {
        ~.*6d96270004515a0486bb7f76196a72b40c55a47f.* 6d96270004515a0486bb7f76196a72b40c55a47f;
        ~.*956f7fd1ae68fecb2b32186415a49c316f769d75.* 956f7fd1ae68fecb2b32186415a49c316f769d75;
        default "";
    }
    # ...
    limit_req_zone $badclients zone=badclients:1M rate=1r/s;
}

Here we look into the Authorization header for API token like Authorization: Bearer 1234567890. If we matched against a few known tokens we use that value for $badclients variable and then again use it as a key for limit_req_zone.

Now, that we have configured 3 rate limit zones we can apply them where it’s needed. Here is the full config:

http {
    # ...
    # Global rate limit per IP.
    # Used when child context doesn't provide rate limiting configuration.
    limit_req_zone $binary_remote_addr zone=global:100m rate=100r/s;
    limit_req zone=global;
    # ...

    # Rate limit zone for crawlers
    map $http_user_agent $crawler {
        ~*.*(bot|spider|slurp).* $http_user_agent;
        default "";
    }
    limit_req_zone $crawler zone=crawlers:1M rate=1r/m;

    # Rate limit zone for bad clients
    map $http_authorization $badclients {
        ~.*6d96270004515a0486bb7f76196a72b40c55a47f.* 6d96270004515a0486bb7f76196a72b40c55a47f;
        ~.*956f7fd1ae68fecb2b32186415a49c316f769d75.* 956f7fd1ae68fecb2b32186415a49c316f769d75;
        default "";
    }
    limit_req_zone $badclients zone=badclients:1M rate=1r/s;

    server {
        listen 80;
        server_name www.example.com;
        # ...
        limit_req zone=crawlers; # Apply to all locations within www.example.com
        limit_req zone=global; # Fallback
        # ...
    }

    server {
        listen 80;
        server_name api.example.com;
        # ...
        location /heavy/method {
            # ...
            limit_req zone=badclients; # Apply to a single location serving some heavy method
            limit_req zone=global; # Fallback
            # ...
        }
        # ...
    }

}

Note that we had to add global zone as a fallback whenever we have other limit_req configurations. That’s needed because nginx fallback to limit_req defined in the parent context only if the current context doesn’t have any limit_req configuration.

So the general pattern for configuring rate limiting is the following:

Prepare variable that will store a key for rate limiting. The keys must be distinct for different rate limiting buckets.
Empty key disables rate limiting.
Use the variable with rate limiting key to configure rate limiting zone configuration.
Apply rate limit zone where needed with limit_req.
If you need a fallback configuration, define it together with the configuration on the current level.

Rate limit will help to keep your system stable. Now let’s talk about caching that can remove some excessive load from the backends.

Caching

One of the greatest features of nginx is its ability to cache responses.

Let’s say we are proxying requests to some backend that returns static data that is expensive to compute. We can shave the load from that backend by caching its response.

Here is how it’s done:

http {
    # ...
    proxy_cache_path  /var/cache/nginx/billing keys_zone=billing:500m max_size=1000m inactive=1d;
    # ...

    server {
        # ...
        location /billing {
            proxy_pass http://billing_backend/;

            # Apply the billing cache zone
            proxy_cache billing;

            # Override default cache key. Include `Customer-Token` header to distinguish cache values per customer
            proxy_cache_key "$scheme$proxy_host$request_uri $http_customer_token";

            proxy_cache_valid 200 302 1d;
            proxy_cache_valid 404 400 10m;
        }
    }
}

In this example, we cache responses from the “billing” service that returns billing information for a client. Imagine that these requests are heavy so we cache them per customer. We assume that clients access our billing API with the same URL but provides a Customer-Token HTTP header to distinguish themselves.

First, caching needs some place where it will store the values. This is configured with the proxy_cache_path directive. It needs at least 2 required params - keys_zone and path. The keys_zone gives a name to the cache and sets the size of the hash table to track cache keys. Path will hold the actual files named after MD5 hash of the cache key which is, by default, is the full URL of the request. But you can, of course, configure your own cache key with the proxy_cache_key directive where you can use any variables including HTTP headers and cookies.

In our case, we have overridden the default cache key by adding the $http_customer_token variable holding the value of the Customer-Token HTTP header. This way we will not poison the cache between customers.

Then, as with rate limits, you have to apply the configured cache zone to the server, location, or globally using proxy_cache directive. In my example, I’ve applied caching for a single location.

Another important thing to configure from the start is cache invalidation. By default, only responses with 200, 301, and 302 HTTP codes are cached, and values older than 10 minutes will be deleted.

Finally, when proxying requests to upstreams, nginx respects some headers like Cache-Control. If that header contains something like no-store, must-revalidate then nginx will not cache the response. To override this behavior add proxy_ignore_headers "Cache-Control";.

So to configure nginx cache invalidation do the following:

Set the max_size in proxy_cache_path to bound the amount of disk that cache will occupy. If the nginx would need to cache more than max_size it will evict the least recently used values from the cache
Set the inactive param in proxy_cache_path to configure the TTL for the whole cache zone. You can override it with proxy_cache_valid directive.
Finally, add proxy_cache_valid directive that will instruct the TTL for the cache items in a given location or server and that will set TTL for cache items.

In my example, I’ve configured caching of 200 and 302 responses for a day. And also for error responses I’ve added caching for 10 minutes to avoid thrashing the backend in vain.

Gradual rollout of a new service

Another feature that is rarely used, but when it’s needed it’s a godsend, is a gradual rollout.

Imagine you are doing a massive rewrite of your product. Maybe you’re migrating to a new database system, rewriting backend in Go, or moving to a cloud. Whatever.

Your current version is used by all of the clients and you have deployed the new version alongside. How would switch clients from the current backend to the new one? The obvious choice is to just flip the switch and hope everything will work. But hope is not a good strategy.

You could’ve tested your new version rigorously. You might even do the traffic mirroring to ensure that your new system operates correctly. But anyway, from my experience there is always something that goes wrong - forgotten important header in the response, slightly changed format, rare request that swamps your DB.

I’m sure that it’s better to gradually rollout massive changes. Even a few days helps a lot. Sure, it requires more work to do but it pays off.

The main feature in nginx that provides gradual rollout is a split_client module. It works like map but instead of setting variable by some pattern, it creates the variable from the source variable distribution. Let me illustrate it:

http {
    upstream current {
        server backend1;
        server backend2;
    }

    upstream new {
        server newone.team.svc max_fails=0;
    }

    split_clients $arg_key $destination {
        5% new;
        *  current;
    }

    server {
        # ...
        location /api {
            proxy_pass http://$destination/;
        }
    }
}

This split_client configuration does the following - it looks into the key query argument and for 5% of the values it sets $new_backend to 1. For the other 95% of keys, it will set $new_backend to 0. The way it works is that the source variable is hashed into a 32-bit hash that produces values from 0 to 4294967296, and the X percent is simply the first 4294967296 * X / 100 values (for 5% it’s a 4294967296 * 5 / 100 = 214748364 first values).

Just to give you a sense of how the 5% example above behaves, here is what distribution looks like

key | $destination
----+-------------
1   |   current
2   |   current
3   |   current
4   |   current
5   |   current
6   |   current
7   |   current
8   |   new
9   |   current
10  |   new

Since split_client creates a variable you can use it in our beloved map to construct more complex examples like this:

http {
    upstream current {
        server http://backend1/;
        server http://backend2/;
    }

    upstream new {
        server http://newone.team.svc/ max_fails=0;
    }

    split_clients $arg_key $new_api {
        5% 1;
        *  0;
    }

    map $new_api:$cookie_app_switch $destination {
        ~.*:1 new;
        ~0:.* current;
        ~1:.* new;
    }

    server {
        # ...
        location /api {
            proxy_pass http://$destination/;
        }
    }
}

In this example, we are combining the value from the split_clients distribution with the value of the app_switch cookie. If the cookie is set to 1, we set $destination to new upstream. Otherwise, we look into the value from split_clients. This is a kind of feature flag to test the new system in production - everyone with the cookie set will always get responses from the new upstream.

The distribution of the keys is consistent. If you’ve used API key for split_clients then the user with the same API key will always be placed into the same group.

With this configuration, you can diverge traffic to the new system starting with some small percentage and gradually increment the percentage. The little downside here is that you have to change the percentage value in the config and reload nginx with nginx -s reload to apply it - there is no builtin API for that.

Now, let’s talk about nginx logging.

Structured logs

Collecting logs from nginx is a great idea because it’s usually an entrypoint for the clients’ traffic and so it can report actual service experience as customers see it.

To get any profit from logs they should be collected in some central place like Elastic stack or Splunk where you can easily query and even build decent analytics from it. These log management tools require structured data but nginx by default is logging in the so-called “combined” log format which is an unstructured mess that is expensive to parse.

The solution to this is simple - configure structured logging for nginx. We can do this with the log_format directive. I always log in JSON format because it’s understood universally. Here is how to configure JSON logging for nginx:

http {
    # ...
    log_format json escape=json '{'
        '"server_name": "billing-proxy",'
        '"ts":"$time_iso8601",'
        '"remote_addr":"$remote_addr","host":"$host","origin":"$http_origin","url":"$request_uri",'
        '"request_id":"$request_id","upstream":"$upstream_addr",'
        '"response_size":"$body_bytes_sent","upstream_response_time":"$upstream_response_time","request_time":"$request_time",'
        '"status":"$status"'
        '}';
    # ...
}

Yes, it’s not the prettiest thing in the world but it does the job. You can use any variables in the format - builtin in nginx and your own that you defined with the map directive.

I use implicit string concatenation here to make it more readable - there are multiple single-quoted strings one after another that nginx will glue together. Inside each string, I use double-quoted strings for JSON fields and values.

The escape=json option will replace non-printable chars like newlines with escaped values, e.g. \n. Quotes and backslash will be escaped too.

With this log format, you don’t need to use the grok filter in logstash and painfully parse logs into some structure. If nginx is running in kubernetes all you have to do is:

filter {
    json {
        source => "log"
        remove_field => ["log"]
    }
}

Because logs from containers are wrapped in the JSON where the log message is store in the "log" field)

Conclusion

And that’s a wrap for my nginx experience so far. I’ve written about nginx mirroring, shared a few features useful when you develop backends behind nginx and here I’m dumping the rest of my knowledge gained while using nginx in production.

Nice nginx features for developers

2021-06-02T00:00:00+00:00

A lot of people use nginx as a web server and fallback for something like haproxy or traefik for service routing. But you can use nginx for that too! In my experience nginx provides rich and flexible ways to route your requests. Here are few things that worked well for me when I was wearing a developer hat.

First, let’s look at the simple config that just forwards requests from http://proxy.local/ address to a single http://backend.local:10000.

user nginx;
worker_processes auto;

events {}

http {
    access_log  /var/log/nginx/access.log combined;

    # include /etc/nginx/conf.d/*.conf;

    upstream backend {
        server backend.local:10000;
    }

    server {
        server_name proxy.local;
        listen 8000;

        location / {
            proxy_pass http://backend;
        }
    }
}

You declare your backend service as an upstream group. Each instance of the backend is described with a server directive.

Then you declare entrypoint with a server and location. Given that it’s nginx you can go crazy with regexp location matching and stuff but it’s not what is required in the case of service routing.

Finally, you forward requests with proxy_pass directive.

From this simple config, we can start to build the necessary complexity.

Active/Passive backend configuration

If your service needs active/passive configuration where one server is the main for requests handling and the other is a backup then you can configure it like this:

    ...
    upstream backend {
        server main-backend.local:10000;
        server backup-backend.local:10000 backup;
    }
    ...

backup option tells nginx that this server in the upstream group will be used only if the primary server is unavailable.

By default, server is marked as unavailable after 1 connection error or timeout. This can be tuned with max_fails option for each server in an upstream group like this:

    ...
    upstream backend {
        # Try 3 times for the main server
        server main-backend.local:10000 max_fails=3;

        # Try 10 times for backup server
        server backup-backend.local:10000 backup max_fails=10;
    }
    ...

In addition to connection errors and timeouts, you can use various HTTP error codes like 500 as unsuccessful attempts. This is configured by the proxy_next_upstream directive.

...
    upstream backend {
        server main-backend.local:10000;
        server backup-backend.local:10000 backup;
    }

    server {
        server_name proxy.local;
        listen 8000;

        # Switch to the next upstream in  case of connection error, timeout
        # or HTTP 429 error (rate limit).
        proxy_next_upstream error timeout http_429;

        location / {
            proxy_pass http://backend;
        }
    }
...

Proxy to Kubernetes service

max_fails option is crucial if your nginx is running inside Kubernetes and you want to proxy requests to the Kubernetes service (using cluster DNS). In this case, you should have a single server with max_fails=0 like this:

    ...
    upstream backend {
        server app.my-team.svc max_fails=0;
    }
    ...

This way nginx will not mark Kubernetes service as unavailable. It won’t try to do passive health checks. All of these are not needed because Kubernetes service is doing active health checks by itself with readiness probes.

Flexible routing with `map`

Sometimes you need to route requests based on some header value. Or query parameter. Or cookie value. Or hostname. Or any combination of those.

And this is the case where nginx really shines. It’s the only (in my experience) proxy server that allows requests routing with almost arbitrary logic.

The key part that makes this possible is ngx_http_map_module. This module allows you to define variable from the combination of other variables with regular expressions. Sounds complicated but wait for it.

Say, we have 3 backend services that are serving different kinds of data:

Live data service that returns the most recent data that were just collected.
Historical data service that returns old data.
Aggregated data service that returns precalculated data.

Call it microservices architecture, whatever.

These services are exposed to users via the same endpoint https://.api.com/?report=. Here are a few examples to give you an idea of how it works:

https://2021-04-01.api.com/?report=list_records should route to the historical data service
https://api.com/?report=list_records should route to the live data service
https://api.com/?report=counters should also route to the aggregated data service
https://2018-11-01.api.com/?report=counters should route to the aggregated data service

This may seem like an ugly API but this is how the real world often looks like and you have to deal with it.

So let’s write a routing configuration. First, define 3 upstream groups:

upstream live {
    server live-backend-1:8000;
    server live-backend-2:8000;
    server live-backend-3:8000;
}

upstream hist {
    server hist-backend-1:9999;
    server hist-backend-2:9999;
}

upstream agg {
    server agg-backend-1:7100;
    server agg-backend-2:7100;
    server agg-backend-3:7100;
}

Next, define the server that will listen for all requests and somehow route them:

    server {
        server_name *.api.com "";
        listen 80;

        location / {
            # FIXME: proxy pass to who?
            proxy_pass http://???;
        }
    }

The question is what should we write in proxy_pass directive?

Since nginx configuration is declarative we can write proxy_pass http://$destination/ and build the destination variable with maps.

In our example service, we make a routing decision based on the report query variable and date subdomain. This is what we need to extract into our variables:

map $host $date {
	"~^((?\d{4}-\d{2}-\d{2}).)?api.com$" $subdomain;
	default "";
}

Map will parse $host variable (one of the many predefined nginx variables) and set the result of parsing into our $date variable. Inside the map, there are parsing rules.

In my case there are 2 rules - the main one with regex and the other is a fallback denoted with the default keyword.

You can inspect the regex in regex101. The first symbol ~ marks the rule as a regular expression. Our regex starts with ^ and ends with $ which denotes the start and end of the line - it’s a kind of a best practice for regexes to explicitly match the whole string and I use it as much as possible. To extract the subdomain we create a group with parenthesis. Inside that group I use \d{4}-\d{2}-\d{2} to parse the date format 2021-05-01. There is also ? thing inside the group. This is called capture group and it’s just to give a name to the matched part of the regex. Capture group is then used on the right side of the map rule to assign its value to the $date variable. Note that subdomain is optional so we need to wrap in parenthesis together with the dot (subdomain delimiter) and add ? to the whole group.

Phew! The regex part is done so we may relax.

To extract report we don’t need to use a map because nginx provides arg_ predefined variables for query parameters. So report query parameter can be accessed as arg_report.

The full list of nginx variables can be googled with “nginx varindex” and is located here.

Ok, so now we have the date and report. How can we construct $destination variable from it? With another map! The trick here is that you can use a combination of variables to create the new variable in the map:

map "$arg_report:$date" $destination {
    "~counters:.*" agg;
    "~.*:.+" hist;
    default live;
}

The combination here is a string where 2 variables are joined with a colon. Colon is a personal choice and used for convenience. You can use any symbol, just make sure that regex will be unambiguous.

In the map, we have 3 rules.

First is to set $destination to agg when report query parameter is counters.
Second is to set $destination to hist when $date variable is not empty.
The default value set when nothing else matches is to set $destination to live.

Regexes in the map are evaluated in order.

Note that $destination value is the name of the upstream group.

Here is the full config:

events {}

http {
    upstream live {
        server live-backend-1:8000;
        server live-backend-2:8000;
        server live-backend-3:8000;
    }

    upstream hist {
        server hist-backend-1:9999;
        server hist-backend-2:9999;
    }

    upstream agg {
        server agg-backend-1:7100;
        server agg-backend-2:7100;
        server agg-backend-3:7100;
    }

    map $host $date {
        "~^((?\d{4}-\d{2}-\d{2}).)?api.local$" $subdomain;
        default "";
    }

    map "$arg_report:$date" $destination {
        "~counters:.*" agg;
        "~.*:.+" hist;
        default live;
    }

    server {
        server_name *.api.com "";
        listen 80;

        location / {
            proxy_pass http://$destination/;
        }
    }
}

Passing request to Consul services

If you use Consul for service discovery then your services can be accessed via DNS provided by Consul. It’s as simple as curl myapp.service.consul.

Very convenient but nobody knows how to resolve names in .consul zone. Consul docs gives a few ways to configure it universally in your infrastructure. I’ve used dnsmasq with great success.

Anyway, to route requests in nginx via Consul DNS you don’t have to go hard. There is a resolver directive in nginx for using custom DNS servers.

Here is how to forward requests via Consul DNS from nginx:

...
    server {
        server_name *.api.com "";
        listen 80;

        # Resolve using Consul DNS. Fallback to Google and Cloudflare DNS.
        resolver 10.0.0.1:8600 10.0.0.2:8600 10.0.0.3:8600 8.8.8.8 1.1.1.1;
        location /v1/api {
            proxy_pass http://prod.api.service.consul/;
        }
        location /v1/rpc {
            proxy_pass http://prod.rpc.service.consul/;
        }
    }
...

Update: Nice people at lobste.rs pointed out that proxy_pass caches DNS response until restart. There are a few ways to fix this. First, put the Consul service URL into the upstream and use valid option in resolver directive for tuning DNS response TTL. The other option is to use a variable for proxy_pass as described by Jeppe Fihl-Pearson here. Apparently, when nginx sees a variable in proxy_pass it will honor the TTL of DNS response.

Yes, it’s not dynamic in the way that traefik does it. If a new service needs to be added you have to edit the nginx config somehow while traefik does this automatically.

But you can implement decent service discovery using consul template that will update nginx config from consul data.

Conclusion

Nginx is a very versatile tool. It has a rich configuration language that enables nice features for developers.

Active/passive load balancing with configured failover
Flexible requests routing
Easy integration with Consul DNS

Yes, it’s not perfect - the upstream healthchecks are passive (in the open source version), configuration defaults are not modern, initial setup is rough.

But given all the richness, investing a little bit of time into it is worth it. Before ditching it in favor of something else, think hard about all the features that nginx provides.

How to use Ansible check mode with async tasks

2020-09-25T00:00:00+00:00

One of the most annoying things in ansible is this error:

TASK [Some long command like backup job] ***************************
task path: /home/avd/src/ansible/playbook.yml:4
fatal: [localhost]: FAILED! => {
    "changed": false,
    "msg": "check mode and async cannot be used on same task."
}

I often see it because I check every playbook that I run with “check mode”.

Check mode in Ansible is doing everything described in the task except actually executing it. It’s like --dry-run in svn if you remember those things.

Most of the time check mode works but when the async mode is enabled it fails with the above error. Async tasks are the ones that run for a long time and when your job fails in the middle after a few hours because your variable was rendered incorrect it is very frustrating.

So what if you really need to check async task?

Today, I found a way to do this:

  async: "{{ ansible_check_mode | ternary(0, 21600) }}"

This little trick checks for check mode and if it’s set the async will be disabled because it’s set 0. If check mode is not set it will set the desired async timeout.

Here is an example playbook with this trick applied:

---
- hosts: localhost
  tasks:
    - name: Some long command like backup job
      command: >-
        echo "/usr/local/bin/backup-job {{ date }} {{ destination }}"
      async: "{{ ansible_check_mode | ternary(0, 10800) }}"

Run it and see your check mode stuff:

$ ansible-playbook -C -vvv playbook.yml -e date='2020-09-25' -e destination='s3://mybucket/backups/'
ansible-playbook 2.9.13
...

PLAYBOOK: playbook.yml **********************************************************************************
1 plays in playbook.yml

PLAY [localhost] ****************************************************************************************

TASK [Gathering Facts] **********************************************************************************
task path: /home/avd/src/ansible/playbook.yml:2
...

TASK [Some long command like backup job] ****************************************************************
task path: /home/avd/src/ansible/playbook.yml:4
...
skipping: [localhost] => {
    "changed": false,
    "invocation": {
        "module_args": {
            "_raw_params": "echo \"/usr/local/bin/backup-job 2020-09-25 s3://mybucket/backups/\"",
            "_uses_shell": false,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true,
            "warn": true
        }
    },
    "msg": "skipped, running in check mode"
}
META: ran handlers
META: ran handlers

PLAY RECAP **********************************************************************************************
localhost                  : ok=1    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0

Redis experience

2020-01-18T00:00:00+00:00

Intro

Redis is an indispensable tool for many software engineering problems because it provides great primitives, it’s fast and solid. Most of the time it’s used as some sort of cache. But if you stretch it to other use cases its behavior may surprise you.

Recently we’ve tried to use it as persistent storage for a large dataset. We’ve got a lot of problems, fixed many and gained a lot of experience that I wanted to share. So here is my experience report.

Disclaimer – all of these problems arose from our use case and not because Redis is somewhat flawed. Like any piece of software it requires understanding and research before deployed in any decent production environment.

Our use case

We have a data collecting pipeline with the following requirements:

We need aggregated counters to calculate various metrics during data collect
There are more than 800 million keys where 97% of the keys hold a couple of integers
We need to make it available because our data pipeline is always working
We need to cleanup outdated entries because our dataset is changing every day
We want it to be persistent because loading that amount of data takes a lot of time and we really don’t like to stop our data pipeline

Cluster

Given our requirements we started to use Redis cluster from the start. We chose it over single master/replica because we couldn’t fit our 800M+ keys on a single instance and because Redis cluster provides high availability kinda out of the box (you still need to create the cluster with redis-trib.rb or redis-cli --cluster create). Also, such beefy nodes are very hard to manage – loading of the dataset would take about an hour, the snapshot would take a long time

generally, I prefer to use many small nodes with small datasets on each instead of a few huge nodes.

So, I’ve setup Redis cluster and this time I did it without cross replication because I’ve used Google Cloud instances and because cross replication is very tedious to configure and painful to maintain.

Now, it’s time to load the data.

Loading data

The naive way of loading data by sending millions of SET commands is very inefficient because you’ll spend most of the time waiting for command RTT. Instead, you should use pipelining or even generate a file with Redis protocol for mass insert.

I have experience with pipelining and would recommend this way because it allows you to control the process and anyway it’s much more convenient than generating text files.

With pipelining I saw more than 300K RPS on insert (SET/HSET/SADD) so it’s very performant. But it has one crucial point regarding the Redis cluster mode – multi-key commands must hit the same node. That’s understandable because all commands in a pipeline are seen as one and to generate the response you don’t need to gather data from other nodes (potentially failing) but instead do everything in a single process context.

Nevertheless, it’s possible to use pipelining with Redis cluster – you just have to use hash tags. Hash tags are a substring in curly braces that Redis will use for calculating the hash slot and consequently determine the cluster node. It looks like this:

SET {shard}:key

{shard} is a hash tag.

All operations in a pipeline must have the same hash tag to succeed. But the problem here is that all keys with the same hash tag will be on the same node in the same hash slot. This will lead to uneven data distribution and imbalanced memory consumption on Redis cluster nodes. In our use case data partitions were very different in sizes and after the data loading we got a 3x discrepancy in memory consumption between some nodes. This is a problem because you’ll have different utilization of cluster nodes and you don’t know how to size your cluster now.

It’s possible to rebalance your cluster by moving hash slots between nodes – it’s described in the cluster tutorial. I’ve tried the process described in CLUSTER SETSLOT doc. But I would recommend against this because it’s a manual process, error-prone, you will forget about it the next you need to setup the cluster and essentially it’s a dirty fix.

Going forward

So we started to use Redis cluster, load the data with pipelining and use hash tags to make pipelining work.

Memory consumption

Let’s talk about memory consumption because Redis is an in-memory database, meaning that your dataset is bound by the amount of memory the Redis server node. But you can’t only count the size of your data for capacity planning, you have to remember that storing any Redis key is not free. The main hash table (used for SET) and all Redis datatypes like sets and lists have overhead.

We can see that overhead with a MEMORY USAGE command.

127.0.0.1:6379> mget 0 1000 100000
1) "76876987"
2) "76184956"
3) "74602210"
127.0.0.1:6379> MEMORY USAGE 0
(integer) 43
127.0.0.1:6379> MEMORY USAGE 1000
(integer) 46
127.0.0.1:6379> MEMORY USAGE 100000
(integer) 48

127.0.0.1:6379> DEBUG OBJECT 0
Value at:0x7f21c8ab95e0 refcount:1 encoding:int serializedlength:5 lru:16680050 lru_seconds_idle:103

Serialized length of the value is 5 while real memory usage is 43, so a single simple key storing nothing but single integer value has overhead of almost 40 bytes.

This overhead is needed not only for making hash table work but also for various features that Redis provides to you like efficient memory encoding and LRU keys eviction.

Expires

If you want to store keys with expiration (i.e. TTL) prepare for a 50% increase in memory consumption.

Let’s conduct a simple experiment – load 1 million keys without TTL and then compare memory usage with 1 million keys with TTL.

Here is the initial state with empty redis.

$ redis-cli
127.0.0.1:6379> dbsize
(integer) 0
127.0.0.1:6379> INFO memory
# Memory
used_memory:853328
used_memory_human:833.33K
used_memory_rss:5955584
used_memory_rss_human:5.68M
used_memory_peak:853328
used_memory_peak_human:833.33K
used_memory_peak_perc:100.01%
used_memory_overhead:841102
used_memory_startup:791408
used_memory_dataset:12226
used_memory_dataset_perc:19.74%
...

Load 1 million keys each containing a single random integer:

$ python3 loader.py 
$ redis-cli
127.0.0.1:6379> dbsize
(integer) 1000000
127.0.0.1:6379> info memory
# Memory
used_memory:57240464
used_memory_human:54.59M
used_memory_rss:62619648
used_memory_rss_human:59.72M
used_memory_peak:57240464
used_memory_peak_human:54.59M
used_memory_peak_perc:100.00%
used_memory_overhead:49229710
used_memory_startup:791408
used_memory_dataset:8010754
used_memory_dataset_perc:14.19%
...

Memory usage is 59.72M.

Now let’s load 1 million keys with expire:

$ python3 loader.py --expire
$ redis-cli
127.0.0.1:6379> dbsize
(integer) 1000000
127.0.0.1:6379> info memory
# Memory
used_memory:89628800
used_memory_human:85.48M
used_memory_rss:95326208
used_memory_rss_human:90.91M
used_memory_peak:89628800
used_memory_peak_human:85.48M
used_memory_peak_perc:100.00%
used_memory_overhead:81618318
used_memory_startup:791408
used_memory_dataset:8010482
used_memory_dataset_perc:9.02%
...

Memory consumption grew 52% to 90.91M.

Redis expires gives a lot of additional overhead because, as far as I can tell, they are stored as separate keys in the internal hash table (db->expires).

/* Set an expire to the specified key. If the expire is set in the context
 * of an user calling a command 'c' is the client, otherwise 'c' is set
 * to NULL. The 'when' parameter is the absolute unix time in milliseconds
 * after which the key will no longer be considered valid. */
void setExpire(client *c, redisDb *db, robj *key, long long when) {
    dictEntry *kde, *de;

    /* Reuse the sds from the main dict in the expire dict */
    kde = dictFind(db->dict,key->ptr);
    serverAssertWithInfo(NULL,key,kde != NULL);
    de = dictAddOrFind(db->expires,dictGetKey(kde));
    dictSetSignedIntegerVal(de,when);

    int writable_slave = server.masterhost && server.repl_slave_ro == 0;
    if (c && writable_slave && !(c->flags & CLIENT_MASTER))
        rememberSlaveKeyWithExpire(db,key);
}

By the way, this is the entire function. Redis code is very readable once you get used to the camel case in C.

Our memory consumption

Once we started to load the data in our Redis cluster the memory consumption was too damn high! With our imbalanced cluster we started to use n1-highmem-16 nodes to be able to fit our largest shard which are quite expensive.

So we needed to reduce our memory consumption. And the only way to do this without (almost) any modification to the data is to use Redis hashes.

Hash

One of the nicest tricks to reduce memory consumption is to store values in small Redis hashes instead of the main hash table. This will work because of ziplist optimization in Redis.

In short, with this optimization Redis stores hash values in arrays of configurable size. You avoid hash table overhead but give up lookup speed which is amortized over time because of the small size of the array.

Folks at Instagram used it and we also tried it and shaved off a considerable amount of memory.

But remember that you can’t just shove your values in hash and call it done. To trigger ziplist optimization you need to bucket the hash table to the size of ziplist. Also, with hashes you lose some features, the most important is expires

you can’t set expire on the hash element, only on the key in the main table.

Going forward

So we started to store our dataset in Redis hashes to reduce memory consumption and use smaller instance types for our imbalanced cluster.

Persistence

Finally, we wanted to use persistence because our dataset was important – we cannot lose it because it would lead to the data pipeline downtime and, while we can regenerate all of the data, it takes a lot of time to load.

The key lesson here is that if you want to use persistence in Redis with a lot of data – you have a problem.

It all boils down to the, again, memory consumption that is quickly growing during snapshotting. But first, let’s quickly recall how persistence works.

There are 2 persistence options in Redis – RDB snapshots and AOF log. With RDB snapshots Redis periodically makes snapshot of the in-memory data by forking the main process and writing data in a child process. It works because of Copy-on-Write feature in modern operating systems where parent and child processes can share the memory without doubling the data unless memory is not modified in the parent process. When memory gets written in the parent process the operating system will make a copy for the child so it will see the old version – that’s why it’s called Copy-on-Write.

When RDB snapshotting is performed it should be free in terms of memory consumption because of the CoW but it’s more subtle. If there is new data writing happening during snapshotting then memory consumption will grow on the size of that new data because Copy-on-Write will trigger the creation of new memory pages. The longer your snapshot process the more likely it will hit you. And the more data you’re writing during this process the more your memory consumption will grow.

With the default configuration snapshot will be taken every 10000 changes which in our case means constantly during data upload. We were uploading data in huge batches so our memory consumption grew almost twice and eventually Redis was OOM killed.

So we tried to use AOF instead of RDB. But when AOF log is rewritten it uses the same Copy-on-Write trick as RDB snapshots so we get OOM killed again.

There are a few possible fixes for this. First, you can simply disable persistence if it fits your case. For example, if you can lose or quickly recover your data.

You can also have 2x memory to accommodate extra writes during snapshotting.

And you can also control snapshotting by issuing Manual BGSAVE or REWRITEAOF. But this won’t help you when replica is syncing from the master. This is the most surprising thing I saw with Redis – when replica was crashed and restarted it will need to sync with master. Syncing with master is performed by triggering RDB snapshot and sending it over the network. So even if persistence is completely disabled Redis may trigger RDB snapshotting for replica sync with all the consequences like increased memory consumption and risk of being killed by OOM. And as far as I know, you cannot disable it.

In our case we settled on the manual BGSAVE via cron once a day when the data most likely won’t be uploaded.

Conclusion

At the end of this journey we had a Redis cluster for our simple aggregated data. We loaded data via Redis pipelined commands so we used hash tags. To reduce memory consumption we used Redis hashes. And for persistence we have a cron job that will trigger BGSAVE in idle time.

This is my third post on Redis – I’ve also written on high availability options and cross-replicated cluster.

Doing our use case taught me a lot about Redis – how it works, where it’s good or not and I get a much better understanding of it which is the most important thing for software engineers.

As always if you have any comments or suggestions feel free to send me an email. That’s it for now, subscribe via RSS/Atom feed to stay tuned for the next post. Till the next time!

Prometheus alerts examples

2019-10-29T00:00:00+00:00

Prometheus is my go-to tool for monitoring these days. At the core of Prometheus is a time-series database that can be queried with a powerful language for everything – this includes not only graphing but also alerting. Alerts generated with Prometheus are usually sent to Alertmanager to deliver via various media like email or Slack message.

That’s all nice and dandy but when I started to use it I was struggling because there are no built-in alerts coming with Prometheus. Looking on the Internet, though, I’ve found the following alert examples:

From my point of view, the lack of ready-to-use examples is a major pain for anyone who is starting to use Prometheus. Fortunately, the community is aware of that and working on various proposals:

All of this seems great but we are not there yet, so here is my humble attempt to add more examples to the sources above. I hope it will help you get started with Prometheus and Alertmanager.

Prerequisites

Before you start setting up alerts you must have metrics in Prometheus time-series database. There are various exporters for Prometheus that exposes various metrics but I will show you examples for the following:

node_exporter for hardware alerts
redis_exporter for Redis cluster alerts
jmx-exporter for Kafka and Zookeeper alerts
consul_exporter for alerting on Consul metrics

All of the exporters are very easy to setup except JMX because the latter should be run as Java agent within Kafka/Zookeeper JVM. Refer to my previous post on setting up jmx-exporter.

After setting up all the needed exporters and collecting the metrics for some time we can start crafting out alerts.

Alerts

My philosophy for alerting is pretty simple – alert only when something is really broken, include maximum info and deliver via multiple media.

You describe the alerts in alert.rules file (usually in /etc/prometheus) on Prometheus server, not Alertmanager, because the latter is responsible for formatting and delivering alerts.

The format of alert.rules is YAML and it goes like this:

groups:
- name: Hardware alerts
  rules:
  - alert: Node down
    expr: up{job="node_exporter"} == 0
    for: 3m
    labels:
      severity: warning
    annotations:
      title: Node {{ $labels.instance }} is down
      description: Failed to scrape {{ $labels.job }} on {{ $labels.instance }} for more than 3 minutes. Node seems down.

You have a top-level groups key that contains a list of groups. I usually create group for each exporter, so I have Hardware alerts for node_exporter, Redis alerts for redis_exporter and so on.

Also, all of my alerts have 2 annotations – title and description that will be used by Alertmanager.

Hardware alerts with node_exporter

Let’s start with a simple one – alert when the server is down.

- alert: Node down
  expr: up{job="node_exporter"} == 0
  for: 3m
  labels:
    severity: warning
  annotations:
    title: Node {{ $labels.instance }} is down
    description: Failed to scrape {{ $labels.job }} on {{ $labels.instance }} for more than 3 minutes. Node seems down.

The essence of this alert is expression which states up{job="node_exporter"} == 0. I’ve seen a lot of examples that just use up == 0 but it’s strange because every exporter that is being scraped by Prometheus has this metric, so you’ll be alerted on a completely unwanted thing like restart of postgres_exporter which is not the same as Postgres itself. So I set job label to node_exporter to explicitly scrape for node health.

Another key part in this alert is the for: 3m which tells Prometheus to send alert only when expression holds true for 3 minutes. This is intended to avoid false positives when some scrapes were failed because of network hiccups. It basically add robustness to your alerts.

Some people use blackbox_exporter with ICMP probe for this.

Next is the Linux md raid alert

- alert: MDRAID degraded
  expr: (node_md_disks - node_md_disks_active) != 0
  for: 1m
  labels:
    severity: warning
  annotations:
    title: MDRAID on node {{ $labels.instance }} is in degrade mode
    description: Degraded RAID array {{ $labels.device }} on {{ $labels.instance }}: {{ $value }} disks failed

In this one I check the diff between the total count of the disks and count of the active disks and use diff value {{ $value }} in description.

You can also access metric labels via $labels variable to put useful info into your alerts.

The next one is for bonding status:

- alert: Bond degraded
  expr: (node_bonding_active - node_bonding_slaves) != 0
  for: 1m
  labels:
    severity: warning
  annotations:
    title: Bond is degraded on {{ $labels.instance }}
    description: Bond {{ $labels.master }} is degraded on {{ $labels.instance }}

This one is similar to mdraid one.

And the final one for hardware alerts is free space:

- alert: Low free space
  expr: (node_filesystem_free{mountpoint !~ "/mnt.*"} / node_filesystem_size{mountpoint !~ "/mnt.*"} * 100) < 15
  for: 1m
  labels:
    severity: warning
  annotations:
    title: Low free space on {{ $labels.instance }}
    description: On {{ $labels.instance }} device {{ $labels.device }} mounted on {{ $labels.mountpoint }} has low free space of {{ $value }}%

To calculate free space I’m calculating it as a percentage and check if it’s less than 15%. In the expression above I’m also excluding all mountpoints with /mnt because it’s usualy external to the node like remote storage which may be close to full, e.g. for backups.

The final note here is labels where I set severity: warning. Inspired by Google SRE book I have decided to use only 2 severity levels for alerting – warning and page. warning alerts should go to the ticketing system and you should react to these alerts during normal working days. page alerts are emergencies and can wake up on-call engineer – this type of alerts should be crafted carefully to avoid burnout. Alerts routing based on levels is managed by Alertmanager.

Redis alerts

These are pretty simple – we have a warning alert on redis cluster instance availability and page alert when the whole cluster is broken:

- alert: Redis instance is down
  expr: redis_up == 0
  for: 1m
  labels:
    severity: warning
  annotations:
    title: Redis instance is down
    description: Redis is down at {{ $labels.instance }} for 1 minute.

- alert: Redis cluster is down
  expr: min(redis_cluster_state) == 0
  labels:
    severity: page
  annotations:
    title: Redis cluster is down
    description: Redis cluster is down.

These metrics are reported by redis_exporter. I deploy it on all instances of Redis cluster – that’s why there is a min function applied on redis_cluster_state.

I have a single Redis cluster but if you have multiple you should include that into alert description – possibly via labels.

Kafka alerts

For Kafka we check for availability of brokers and health of the cluster.

- alert: KafkaDown
  expr: up{instance=~"kafka-.+", job="jmx-exporter"} == 0
  for: 3m
  labels:
    severity: warning
  annotations:
    title: Kafka broker is down
    description: Kafka broker is down on {{ $labels.instance }}. Could not scrape jmx-exporter for 3m.

To check whether Kafka is down we check up metric from jmx-exporter. This is the sane way of checking is Kafka process alive because jmx-exporter runs as java agent inside Kafka process. We also filter by instance name because jmx-expoter is run for both Kafka and Zookeeper.

- alert: KafkaNoController
  expr: sum(kafka_controller_kafkacontroller_activecontrollercount) < 1
  for: 3m
  labels:
    severity: warning
  annotations:
    title: Kafka cluster has no controller
    description: Kafka controller count < 1, cluster is probably broken.

This one checks for the active controller. The controller is responsible for managing the states of partitions and replicas and for performing administrative tasks like reassigning partitions. Every broker reports kafka_controller_kafkacontroller_activecontrollercount metric but only current controller will report 1 – that’s why we use sum.

If you use Kafka as an event bus or for any other real time processing you may choose severity page for this one. In my case, I use it as a queue and if it’s broken client requests are not affected. That’s why I have severity warning here.

- alert: KafkaOfflinePartitions
  expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0
  for: 3m
  labels:
    severity: warning
  annotations:
    title: Kafka cluster has offline partitions
    description: "{{ $value }} partitions in Kafka went offline (have no leader), cluster is probably broken.

In this one we check for offline partitions. These partitions have no leader and thus can’t accept or deliver messages. We check for offline partitions on all nodes – that’s why we have sum in alert expression.

Again, if you use Kafka for some real-time processing you may choose to assign page severity for these alerts.

- alert: KafkaUnderreplicatedPartitions
  expr: sum(kafka_cluster_partition_underreplicated) > 10
  for: 3m
  labels:
    severity: warning
  annotations:
    title: Kafka cluster has underreplicated partitions
    description: "{{ $value }} partitions in Kafka are under replicated

Finally, we check for under replicated partitions. This may happen when some Kafka node failed and partition has no place to replicate. This is not preventing Kafka to serve from this partition – producers and consumers will continue to work but the data in this partition is at risk.

Zookeeper alerts

Zookeeper alerts are similar to Kafka – we check for instance availability and cluster health.

- alert: Zookeeper is down
  expr: up{instance=~"zookeeper-.+", job="jmx-exporter"} == 0
  for: 3m
  labels:
    severity: warning
  annotations:
    title: Zookeeper instance is down
    description: Zookeeper is down on {{ $labels.instance }}. Could not scrape jmx-exporter for 3 minutes>

Just like with Kafka we check for Zookeeper instance availability from up metric of jmx-exporter because it runs inside Zookepeer process.

- alert: Zookeeper is slow
  expr: max_over_time(zookeeper_MaxRequestLatency[1m]) > 10000
  for: 3m
  labels:
    severity: warning
  annotations:
    title: Zookeeper high latency
    description: Zookeeper latency is {{ $value }}ms (aggregated over 1m) on {{ $labels.instance }}.

You should really care about Zookeeper performance in terms of latency because if it’s slow dependent systems will fall miserably – leader election will fail, replication will fail and all other sorts of bad things will happen.

Zookeeper latency is reported via zookeeper_MaxRequestLatency metric but it’s gauge so you can’t apply increase or rate function on it. That’s why we use max_over_time looking in 1m intervals.

The alert is checking whether max latency is more than 10 seconds (10000ms). This may seem extreme but we saw it in production.

- alert: Zookeeper ensemble is broken
  expr: sum(up{job="jmx-exporter", instance=~"zookeeper-.+"}) < 2
  for: 1m
  labels:
    severity: page
  annotations:
    title: Zookeeper ensemble is broken
    description: Zookeeper ensemble is broken, it has {{ $value }} nodes in it.

Finally, there is an alert for Zookeeper ensemble status where we sum up metric values for jmx-exporter. Remember that it runs inside Zookeeper JVM so essentially we check whether Zookeeper instances are up and compare it to the majority of our cluster (2 in case of 3-nodes cluster).

Consul alerts

Similar to Zookeeper and any other cluster system we check for Consul availability and cluster health.

There are 2 metrics sources for Consul – 1) The official consul_exporter and 2) the Consul itself via telemetry configuration.

consul_exporter provides most of the metrics for monitoring health of nodes and services registered in Consul. And Consul itself exposes internal metrics like client RPC RPS rate and other runtime metrics.

To check whether Consul agent is healthy we use consul_agent_node_status metric with label status="critical":

- alert: Consul agent is not healthy
  expr: consul_health_node_status{instance=~"consul-.+", status="critical"} == 1
  for: 1m
  labels:
    severity: warning
  annotations:
    title: Consul agent is down
    description: Consul agent is not healthy on {{ $labels.node }}.

Next, we check for cluster degrade via consul_raft_peers. This metric reports how many server nodes are in the cluster. The trick is to apply min function to it so we can detect network partitions where one instance thinks that it has 2 raft peers and the other has 1.

- alert: Consul cluster is degraded
  expr: min(consul_raft_peers) < 3
  for: 1m
  labels:
    severity: page
  annotations:
    title: Consul cluster is degraded
    description: Consul cluster has {{ $value }} servers alive. This may lead to cluster break.

Finally, we check for autopilot status. Autopilot is a feature in Consul when the leader is constantly checking stability of other servers. This is internal metric and it’s reported from Consul itself.

- alert: Consul cluster is not healthy
  expr: consul_autopilot_healthy == 0
  for: 1m
  labels:
    severity: page
  annotations:
    title: Consul cluster is not healthy
    description: Consul autopilot thinks that cluster is not healthy.

Conclusion

I hope you’ll find this useful and these sample alerts will help you jump start your Prometheus journey.

There are a lot of useful metrics you can use for alerts and there is no magic here – research what metrics you have, think how it may help to track the stability of your system, rinse and repeat.

That’s it, till the next time!

How to configure OS Login in GCP for Ansible

2019-05-18T00:00:00+00:00

Recently I started to work with Google Cloud and port some of our infrastructure from metal datacenter to the cloud environment. As an intermediate step, I use Compute Engine instances as servers to host Consul, Prometheus, Zookeeper and other stuff that I have in datacenter. I do this exclusively to maintain production environment parity where infrastructure is managed by Ansible.

This is where SSH access to instances for Ansible is needed. There are 2 ways that this could be accomplished - 1) Add SSH key to the project metadata 2) Use OS Login feature. As you can guess I’m using OS Login. You can read about OS Login and its benefits in docs. Here I’ll show you how to make Ansible work via OS Login.

In the end, we’ll have a service account for Ansible that will be able to SSH connect to instances via OS login.

Service account

In short, OS Login allows SSH access for IAM users - there is no need to provision Linux users on an instance.

So Ansible should have access to the instances via IAM user. This is accomplished via IAM service account.

You can create service account via Console (web UI), via Terraform template or (as in my case) via gcloud:

$ gcloud iam service-accounts create ansible-sa \
     --display-name "Service account for Ansible"

Now, the trickiest part – configuring OS Login for service account. Before you do anything else make sure to enable it for your project:

$ gcloud compute project-info add-metadata \
    --metadata enable-oslogin=TRUE

1. Add roles

Fresh service account doesn’t have any IAM roles so it doesn’t have permission to do anything. To allow OS Login we have to add these 4 roles to the Ansible service account:

Compute Instance Admin (beta)
Compute Instance Admin (v1)
Compute OS Admin Login
Service Account User

Here is how to do it via gcloud:

for role in \
    'roles/compute.instanceAdmin' \
    'roles/compute.instanceAdmin.v1' \
    'roles/compute.osAdminLogin' \
    'roles/iam.serviceAccountUser'
do \
    gcloud projects add-iam-policy-binding \
        my-gcp-project-241123 \
        --member='serviceAccount:ansible-sa@my-gcp-project-241123.iam.gserviceaccount.com' \
        --role="${role}"
done

2. Create key for service account and save it

Service account is useless without key, create one with gcloud:

$ gcloud iam service-accounts keys create \
    .gcp/gcp-key-ansible-sa.json \
    --iam-account=ansible-sa@my-gcp-project.iam.gserviceaccount.com

This will create GCP key, not the SSH key. This key is used for interacting with Google Cloud API – tools like gcloud, gsutil and others are using it. We will need this key for gcloud to add SSH key to the service account.

3. Create SSH key for service account

This is the easiest part)

$ ssh-keygen -f ssh-key-ansible-sa

Now, to allow service account to access instances via SSH it has to have SSH key added to it. To do this, first, we have to activate service account in gcloud:

$ gcloud auth activate-service-account \
    --key-file=.gcp/gcp-key-ansible-sa.json

This command uses GCP key we’ve created on step 2.

Now we add SSH key to the service account:

$ gcloud compute os-login ssh-keys add \
    --key-file=ssh-key-ansible-sa.pub

5. Switch back from service account

$ gcloud config set account your@gmail.com

Now, we have everything configured on the GCP side, we can check that it’s working.

Note, that you don’t need to add SSH key to compute metadata, authentication works via OS login. But this means that you need to know a special user name for the service account.

Find out the service account id.

$ gcloud iam service-accounts describe \
    ansible-sa@my-gcp-project.iam.gserviceaccount.com \
    --format='value(uniqueId)'
106627723496398399336

This id is used to form user name in OS login – it’s sa_.

Here is how to use it to check SSH access is working:

$ ssh -i .ssh/ssh-key-ansible-sa sa_106627723496398399336@10.0.0.44
...

sa_106627723496398399336@instance-1:~$ # Yay!

Configuring Ansible

And for the final part – make Ansible work with it.

There is a special variable ansible_user that sets user name for SSH when Ansible connects to the host.

In my case, I have a group gcp where all GCP instances are added, and so I can set ansible_user in group_vars like this:

# File inventory/dev/group_vars/gcp
ansible_user: sa_106627723496398399336

And check it:

$ ansible -i inventory/dev gcp -m ping
10.0.0.44 | SUCCESS => {
    "changed": false, 
    "ping": "pong"
}
10.0.0.43 | SUCCESS => {
    "changed": false, 
    "ping": "pong"
}

And now we have Ansible configured to access GCP instances via OS Login. There is no magic here – just a bit of gluing together a bunch of stuff after reading lots of docs. That’s it for now, till the next time!

Database connect loop in Go

2019-05-13T00:00:00+00:00

Today I wanted to talk about a useful pattern I started to use in my Go programs. Suppose you have some service that needs to connect to the database. This is how it probably looks like:

	db, err := sqlx.Connect("postgres", DSN)
	if err != nil {
		return nil, errors.Wrap(err, "failed to connect to db")
	}

Nice and familiar but why fail immediately? We can certainly do better!

We can just wait a little bit for a database in a loop because databases may come up later than our service. Connections are usually done during initialization so we almost certainly can wait for them.

Here is how I do it:

package db

import (
	"fmt"
	"log"
	"time"

	"github.com/jmoiron/sqlx"
	"github.com/pkg/errors"
)

// ConnectLoop tries to connect to the DB under given DSN using a give driver
// in a loop until connection succeeds. timeout specifies the timeout for the
// loop.
func ConnectLoop(driver, DSN string, timeout time.Duration) (*sqlx.DB, error) {
	ticker := time.NewTicker(1 * time.Second)
	defer ticker.Stop()

	timeoutExceeded := time.After(timeout)
	for {
		select {
		case <-timeoutExceeded:
			return nil, fmt.Errorf("db connection failed after %s timeout", timeout)

		case <-ticker.C:
			db, err := sqlx.Connect("postgres", DSN)
			if err == nil {
				return db, nil
			}
			log.Println(errors.Wrapf(err, "failed to connect to db %s", DSN))
		}
	}
}

Our previous code is now wrapped with a ticker loop. Ticker is basically a channel that delivers a tick on a given interval. It’s a better pattern than using for and sleep.

On each tick, we try to connect to the database. Note, that I’m using sqlx here because it provides convenient Connect method that opens a connection and pings a database.

There is a timeout to avoid infinite connect loop. Timeout is delivered via channel and that’s why there is a select here – to read from 2 channels.

Quick gotcha – initially I was doing the first case like this mimicking the example in time.After docs:

    // XXX: THIS DOESN'T WORK
	for {
		select {
		case <-time.After(timeout)
			return nil, fmt.Errorf("db connection failed after %s timeout", timeout)

		case <-ticker.C:
			...
		}
	}

but my timeout was never exceeded. That’s because we have a loop and so time.After creates a channel on each iteration so it was effectively resetting timeout.

So this simple trick will make your code more robust without sacrificing readability – this is what my diff for the new function looks like:

 // New creates new Article service backed by Postgres
 func NewService(DSN string) (*Service, error) {
-     db, err := sqlx.Connect("postgres", DSN)
+     db, err := db.ConnectLoop("postgres", DSN, 5*time.Minute)
      if err != nil {
              return nil, errors.Wrap(err, "failed to connect to articles db")
      }

There is no magic here, just a simple code. Hope you find this useful. Till the next time!

How I revamped my Vim setup

2019-03-12T00:00:00+00:00

I was using Vim all my professional life but I’ve never made an effort to use conscientiously. I’ve just copy-pasted someone’s config, installed some random plugins and tried to live with it, grumbling in the background when things went not the way I wanted.

It came to a point when I switched to the Visual Studio Code because I wanted a more integrated experience. And I quite liked it! Mainly it’s because its Vim emulation is the best across all the editors including Atom, Sublime and JetBrains products. This is very important to me because I strongly believe that Vim editing language is superior to anything else.

So I’ve used the VS code with Vim mode (of course) for a while but from time to time I missed some Vim features like flexible splits.

And so I decided to revamp my Vim setup. But this time I made it differently.

I introspected my workflow and tuned Vim to the way I work. Not the other way around where you change your habits to work around editor setup. And I encourage you to do this yourself regardless of your editor.

Disclaimer: My setup may seem wrong to you but that’s because it’s tailored to my needs. Don’t blindly copy-paste my config – read the help, think and make it yours.

Here is the quick outline of what I did:

Started by installing Vim the sane way
Learned to use Vim help
Learned core Vim features that I’ve missed
Adjusted Vim to my workflow

1. Installing Vim the sane way

Let’s do this one quick – I use Neovim. I think it’s the best thing happened to the Vim community in the last decade. I like the project philosophy and that it rattled up Vim and now Vim 8.0 has adopted ideas from Neovim like async job control and terminal.

To install Neovim I recommend using AppImage. You just download the single file and run it. No libs, no containers, nothing. It also allows me to run the latest version hassle free. I’ve never used appimage before and thought that it would distribute as some kind of container image but it’s actually a good old binary:

$ file nvim.appimage
nvim.appimage: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.18, stripped

After installing Neovim you should really run :checkhealth and fix top issues – install the clipboard and python provider.

Next, read the help for Neovim setup – :h nvim-from-vim. I’m doing it simple, just put this

set runtimepath^=~/.vim runtimepath+=~/.vim/after
let &packpath = &runtimepath
source ~/.vimrc

to the .config/nvim/init.vim and use the ~/.vimrc for the configuration.

After that, let’s start digging into it.

What this gives you is the latest version of Neovim that’s not conflicting with anything and compatible with Vim.

2. Use Vim help

IMO, Vim help is the most underestimated feature of Vim. I haven’t used it until this revamp and, boy, what I’ve missed! So many useless searching, reading silly blogs and StackOverflow could be avoided if I’ve used the help system.

Vim help consists of 3.7 megabytes of text, half a million of words

$ wc neovim-0.3.4/runtime/doc/* | tail -n1
  90804  543942 3592651 total

Also, almost every plugin you install has its own help so these numbers are not final.

Vim help topics are comprehensive, detailed and cross-referenced. You may be overwhelmed at first because there is a lot of information here. But don’t be discouraged – it’s much much more efficient and useful to read and grasp comprehensive help topic than mindlessly searching for blog posts or StackOverflow. If you could only learn one thing from this post – please, learn to love the Vim help system.

Some tips that helped me.

:h patt then TAB to find help on the subject starting with patt
:h patt then Ctrl-D to find help on the subject containing patt
Vim help system is full of cross-references – you can jump back and forward just like with code by using Ctrl-] and Ctrl-T.

Or even better – read the :help help which is help on help!

Let’s look at the example, if you type :h word-m Vim will open help on word motions:

==============================================================================
4. Word motions						*word-motions*

	or					** *w*
w			[count] words forward.  |exclusive| motion.

	or					** *W*
W			[count] WORDS forward.  |exclusive| motion.

...

Here you can see the header Word motions, its tag word-motions that is used as a subject for :h command.

Next, you see the help itself describing word motions.

Note that there are some words that have some funky symbols around them or shown in different colors. Anything that doesn’t look like the plain text is a help topic by itself – you can jump into it by Ctrl-]. So in this example, we could find what is [count] or what is |exclusive| motion. And that’s enough for efficient using of Vim help.

Here are the things that I’ve found in Vim help:

I’ve configured statusline with the help of :h statusline. All the blog posts were just a waste of time.
:h ins-completion describes comprehensive builtin completion system. Now, I’m using Ctrl-X Ctrl-F to complete filenames in the current directory (useful to insert links in Markdown files). Also, whole line completion with Ctrl-X Ctrl-L is useful for editing data files.
:h window-moving taught me that you can move splits around, e.g. Ctrl-w H will move current window to the left (it will also convert vertical split to horizontal). Also, the whole :h windows.txt is amazing.

Finally, I recommend to everyone familiar with Vim to review :h quickref from time to time.

3. Use missed core features

After I’ve learned to use Vim help I started to discover things that I’ve missed but that was always there.

Remember to check the help for each thing in this list – I’ve conveniently supplied Vim help command and a link to online help.

Auto commands

:help autocmd

Auto commands allow you to tune Vim behavior based on filename or filetype. Basically, it executes Vim commands on events.

I use it to set correct filetype for some exotic files like this

autocmd BufRead,BufNewFile *.pp setfiletype ruby
autocmd BufRead,BufNewFile alert.rules setfiletype yaml

Or to tune settings for particular filetype like this

autocmd FileType yaml set tabstop=2 shiftwidth=2

Other editors required me to install full-blown extensions like Puppet extension or YAML extension but with Vim I keep things simple and lightweight.

Persistent undo

:help undo-persistence

This feature is so awesome yet none of the other editors have it.

It sounds simple – when you exit Vim your edit history is saved so you can open the file again 2 days later and undo the changes.

Edit history is an important part of your context so I think once you get used to it you couldn’t use any other editor without this feature.

To enable persistent undo I’ve done this:

set undodir=~/.vim/undodir
set undofile

Bliss!

Clipboard

:help 'clipboard'

This one is actually more of a hard fix than a feature.

Clipboard in Linux is a complicated story. All these buffers and selections don’t make things understandable. And Vim makes it even more complicated with its registers.

For years I had these mappings

" C-c and C-v - Copy/Paste to global clipboard
vmap  "+yi
imap  "+gpi

that makes Ctrl-c and Ctrl-v work.

But why use two-key combos when you can use a simple y and p for copying and pasting?

Turns out, you can make it work very nice by using this single setting:

set clipboard+=unnamed

It makes y and p copy and paste to the “global” buffer that is used by other apps like the browser.

Mappings

:help mapping

What I like the most about Vim is that its normal mode allows you to use all keys for a command while others require to use some key combo based on modifier (Ctrl-o, Ctrl-s).

When you can use any key for a command it’s natural to use a single key shortcuts, e.g. p to paste the text.

And what is even more awesome is that you can map a key or a sequence of keys at your own will.

Here are my most used mappings:

nnoremap ; :Buffers
nnoremap f :Files
nnoremap T :Tags
nnoremap t :BTags
nnoremap s :Ag

NOTE: these mappings override default Vim motions and actions because I don’t use them. It may be better for you to map it via leader key. Anyway, read the help on what these letters do by default and decide whether you want to override them.

These mappings invoke fzf command (more on this later) using a single key.

If I need to go to some function I just press t and got the list of tags of the current file. Not Ctrl-t, not Shift-t, just t. Combined with fzf fuzzy find it’s very powerful.

True colors in Vim

:help termguicolors

For years I’ve been using Vim in a terminal without knowing that I’ve been using 8-bit colorscheme. And it was actually ok because 256 colors is kinda enough.

It’s worth noting that I’m using my own colorscheme called tile. While tuning some of the colors I didn’t understand why I don’t see the difference and then I’ve read the help on syntax highlighting and realized that I want true colors in Vim.

Also, most of the colorschemes that you see in the wild, e.g. on https://vimcolors.com/ are presented in the 24-bit colors. So you’ll be disappointed when you don’t see the same colors when you install the colorscheme in your Vim.

Also also, your terminal is almost certainly capable of displaying in True Color so why limit yourself to the 256?

It’s all boils down to the simple set termguicolors in your vimrc. This options simply enable true color for Vim. Here is the difference with my colorscheme:

Search history

The last one is quick but so great that I even tweeted about it:

4. Tuning Vim to my workflow

All of the things above already boosted my productivity but Vim can do even better when you know what you want.

In my case, here was the list:

Working with projects (sessions)
Autocompletion
Quick file find by fzf
Quick search in files via ag (the_silver_searcher)
Tag jumping using ctags index
Find usages via cscope index
Git integration (spoiler: no Fugitive)
Linter integration
Build integration
Various niceties

So let’s dive in.

Working with projects

For me working with projects is about saving context – Open files, layout, cursor positions, settings, etc.

Vim has sessions (:help session) that does all that.

To save a session you have to :mksession! (or short :mks!) and then to load session start it with vim -S Session.vim. It may be enough for you but I found it kinda cumbersome to use as is.

First thing I’ve tried was to automate saving session. I’ve tried nice and simple obsession plugin that does just that. For the loading part, I’ve created bash alias alias vims='vim -S Session.vim'.

This was OK but a few things were annoying. The way I work is like this: I have multiple projects that are kept in separate directories as separate git repos. If I want to do something I cd into that dir, open the file, edit it or just view, and then do something else.

When I was opening a file with Vim inside a directory session wasn’t applied, so I had to manually :source it. After doing this for a week it was obvious that it’s not the way I wanted.

And then I’ve found an amazing vim-workspace plugin that does exactly what I need. It creates a session when you :ToggleWorkspace and keeps it updated. Then when you open any file in the workspace it automatically loads the session.

It also has very nice command :CloseHiddenBuffers that, well, closes hidden buffers. It’s very useful because during session lifetime you open files and Vim keeps them open. With this single command you can leave only the current buffer.

So I settled on the vim-workspace and found peace.

Autocompletion

Since the last time I’ve done Vim configuration, which was around 2008, a lot of things changes. But the most exploded sphere in Vim, from my point of view, was autocompletion support in Vim.

Vim gained sophisticated completion engine (:h ins-completion) with the omni-competion that gave birth to the whole load of plugins. YouCompleteMe, OmniCppComplete, neocomplcache/neocomplete/deoplete, AutoComplPop, clang_complete, …

It is complicated and I was exhausted while researching on this topic, so here is the shortest possible guide on completion plugins:

YouCompleteMe – very powerful but huge plugin (>200 MB installed). Works as a client-server, requires a lot of utils.
VimCompletesMe – a wrapper around Vim’s built-in completion hence super lightweight.
Deoplete – current completion plugin by Shougo (previous were neocomplete and neocomplcache). Works as a client-server, much more lightweight than YouCompleteMe, can complete from a diverse set of sources.
Other plugins are usually specific for concrete language.

My choice is deoplete because it’s fast, versatile, and not heavy. If you want to keep things native, then I’d recommend using VimCompletesMe. I’ve tried to use YouCompleteMe, had some troubles with installation, gave it 250 MB and it just showed me the function names without signatures and argument names. So I was disappointed and switched to deoplete that provides more info.

For the Deoplete I’ve added a few completion sources:

Shougo/neco-syntax for generic syntax completion
ujihisa/neco-look for dictionary completion – useful for writing blog posts.
Shougo/deoplete-clangx for C/C++ completion
deoplete-plugins/deoplete-go for Go completion
deoplete-plugins/deoplete-jedi for Python completion

There is also tmux-complete that can complete from other tmux panes. Like view logs in one pane and Vim in the other pane can complete the values from it! It works but I don’t use tmux much.

There is also webcomplete completion source that completes from the currently open web page in Chrome. Alas, it works only on macos. There is an open discussion about adding support for Chrome on Linux.

Quick file find

The ability to quickly open file is crucial to my productivity. And I need to open a file by partial name. As an example, suppose I’m working in some ansible repo. I know that I have a template file for setting environment vars. I don’t remember exactly the full path but I know that it has env in it.

So I use fzf to sift through the list of file in the project that is generated by ag -l. Here is how it works live:

There are other plugins that do that like CtrlP but I use fzf for other things – list of buffers (open files), search, git commits, list of tags, history of search and history of command. Anything that should be sifted through is piped to the fzf because it does this job really well.

File find is launched with a single letter command f in the normal mode.

Quick search in files

Before this revamp I’ve used builtin / Vim command to search in the current buffer and :Ag to search in the files. I really like ag – it’s fast and very handy.

After I’ve embarked on the fzf I hooked Ag output to it and now it works even better:

File search is launched with a single letter command s in the normal mode.

Find usages

This was my long wished dream – when I stumble on some function I want to see its callers. Sounds simple but it’s a difficult task. The only thing that can do it and that is not tied to an IDE is cscope.

But cscope is a, how to say nice, weird thing. It requires you to build its own database by supplying a list of files and then provides tui interface to interact with. Its documentation doesn’t help much and it feels that nobody uses it.

This idiosyncratic cscope workflow was the main reason why I occasionally opted for other editors and IDEs. Just to see if they have “find usages” implemented well.

But this time I said to myself – you have to make it work. And here is what I did.

First, I started with automatically generating cscope database. I use vim-gutentags for this – it generates ctags index and cscope database on file save.

Then to integrate cscope I’ve tried different things:

Tried to use CCTree but it builds its own cscope database and fails with some strange errors I don’t want to touch. So ditch it.
Tried various cscope plugins – everything is just remapping of builtin cscope functions. No fzf support
Finally settled on this thing based on https://gist.github.com/amitab/cd051f1ea23c588109c6cfcb7d1d5776

" cscope
function! Cscope(option, query)
  let color = '{ x = $1; $1 = ""; z = $3; $3 = ""; printf "\033[34m%s\033[0m:\033[31m%s\033[0m\011\033[37m%s\033[0m\n", x,z,$0; }'
  let opts = {
  \ 'source':  "cscope -dL" . a:option . " " . a:query . " | awk '" . color . "'",
  \ 'options': ['--ansi', '--prompt', '> ',
  \             '--multi', '--bind', 'alt-a:select-all,alt-d:deselect-all',
  \             '--color', 'fg:188,fg+:222,bg+:#3a3a3a,hl+:104'],
  \ 'down': '40%'
  \ }
  function! opts.sink(lines) 
    let data = split(a:lines)
    let file = split(data[0], ":")
    execute 'e ' . '+' . file[1] . ' ' . file[0]
  endfunction
  call fzf#run(opts)
endfunction

" Invoke command. 'g' is for call graph, kinda.
nnoremap  g :call Cscope('3', expand(''))

What it does is call cscope and feed its output to fzf. '3' is the field number in cscope TUI interface (yeah, you read it correct, :facepalm:) corresponding to Find functions calling this function.

This thing works – I pasted it to my vimrc and invoke it via g but it needs to be packaged as a plugin. Maybe I’ll do this sometime.

Overall cscope feels like fucking dirt but we don’t have anything better.

Git integration

I’ve got used to console interface of git because it’s stable, independent of any editor and it provides all features of git because it’s the main interface. And I’m very comfortable with this way of working with git.

So my requirements for Git was pretty little – actually, I wanted to explore how this integration could help my workflow.

First, I’ve tried fugitive but quickly found that it was not for me. It was not suitable for my workflow. The main problem is that it messes my windows layout by opening its own buffers with git output:

When I invoke :Gstatus I want to see the changes, so I invoke :Gdiff. It opens the diff in the closest window replacing buffer I was editing. That’s OK but when I’m done with the diff I want to close diff and return to the previous buffer. And this is where it gets complicated – diff is a 2 window, so I have to return with Ctrl-o to the previous buffer in one window and then kill the other buffer with :bd. This is really not convenient.
:Glog just spits git log output in messages.
:Gblame shows the standard git blame output and that’s OK. When I try to view commit from blame it opens it in the current window, again messing with my layout, and scrolls the commit to the diff of the chosen lines. This is not what I want, I want to view the commit message and other related changes. The scrolled part is what I already saw when I was doing blame.

So I’ve ditched it and settled on vim-gitgutter because it’s nice and doesn’t interfere with my workflow. This plugin shows line status in the gutter. And it provides a motion for next/previous hunk.

Then I’ve tried to use vimagit and it’s great! This is what I really want for Git integration – a convenient staging of changes and writing commit message. Vimagit gives me a buffer with unstaged and staged diffs and a commit message section and simple to use mappings. Really great!

Finally, I’ve found git-messenger that shows blame info (with history) in the floating window.

Build and linter integration

Similar to Git this wasn’t a hard requirement because I’m doing building and linting from the shell or automatically in CI. But, again, I wanted to explore what could be done here.

I setup Neomake as a linting engine. It has a pre-configured list of linters depending on filetype. I’ve configured it to run on only on buffer write (it can be launched at an interval, at reading, etc.) to avoid useless work. The count of warnings and errors of neomake run is shown in the in statusline (see screenshot below ). And the results of linting can be viewed in location list – :lopen, :lnext, :lprev.

Also, Neomake can invoke make program (:help makeprg) without blocking the UI so I’ve added this mapping and that’s it:

nnoremap m :Neomake!

The results of build are in the QuickFix list (:help quickfix).

Various niceties

ZoomWinTab

This plugin is a godsend for me. I use splits a lot and sometimes I want to temporary zoom the current window. With this plugin, I just do z to toggle the zoom. This is similar to the tmux zoom feature.

Sensible

vim-sensible provides sensible defaults like enabling filetype, autoread, statusline. But most important for me was this line

set formatoptions+=j " Delete comment character when joining commented lines

Commentary

Commentary plugin adds actions to quickly comment line, selection or pretty much any motion.

Surround

Surround plugin allows me to easily add, change or delete “surroundings”. For example, I often use it to add quotes to the word with ysw" (I have a mapping for that) and change single quotes to double quotes with cs'".

Conclusion

So here I am, happily living with Vim for about 3 months now. I’ve intentionally waited from posting this to prove myself that my new setup is worth it. And, gosh, it is!

The main boost was getting comfortable with reading Vim help. Yes, I’m trying again to convince you about reading it because it makes you reason about what you do correctly.

And the key point is to tune Vim into your workflow, not the other way around.

Also, I’m tweaking things as I keep finding new ways to make my life in the editor more pleasant. The recent one was set hidden (:h hidden) to prevent nagging 'No write since last change' message when switching buffers.

There is no magic here in Vim when you put some conscientious effort and try to do things your way.

That’s it for now, till the next time!

Envoy first impression

2019-01-25T00:00:00+00:00

When I was doing traffic mirroring with nginx I’ve stumbled upon a surprising problem – nginx was delaying original request if mirror backend was slow. This is really bad because you expect that mirroring is “fire and forget”. Anyway, I’ve solved this by mirroring only part of the traffic but this drove me to find another proxy that could have mirror traffic without such problems. This is when I finally found time and energy to look into Envoy – I’ve heard a lot of great things about it and always wanted to get my hands dirty with it.

Just in case you’ve never heard about it – Envoy is a proxy server that is most commonly used in a service mesh scenario but it’s also can be an edge proxy.

In this post, I will look only for edge proxy scenario because I’ve never maintained service mesh. Keep that use case in mind. Also, I will inevitably compare Envoy to nginx because that’s what I know and use.

What’s great about Envoy

The main reason why I wanted to try Envoy was its several compelling features:

Observability
Advanced load balancing policies
Active checks
Extensibility

Let’s unpack that list!

Observability

Observability is one of the most thorough features in Envoy. One of its design principles is to provide the transparency in network communication given how complex modern systems is built with all this microservices madness.

Out of the box it provides lots of metrics for various metrics system including Prometheus.

To get that kind of insight in nginx you have to buy nginx plus or use VTS module, thus compiling nginx on your own. Hopefully, my project nginx-vts-build will help – I’m building nginx with VTS module as a drop-in replacement for stock nginx with systemd service and basic configs. Think about it as nginx distro. Currently, it had only one release for Debian 9 but I’m open for suggestions. If you have a feature request, please let me know. But let’s get back to Envoy.

In addition to metrics, Envoy can be integrated with distributed tracing systems like Jaeger.

And finally, it can capture the traffic for further analysis with wireshark.

I’ve only looked at Prometheus metrics and they are quite nice!

Advanced load balancing

Load balancing in Envoy is very feature-rich. Not only it supports round-robin, weighted and random policies but also load balancing using consistent hashing algorithms like ketama and maglev. The point of the latter is fewer changes in traffic patterns in case of rebalancing in the upstream cluster.

Again, you can get the same advanced features in nginx but only if you pay for nginx plus.

Active checks

To check the health of the upstream endpoints Envoy will actively send the request and expect the valid answer so this endpoint will remain in the upstream cluster. This is a very nice feature that open source nginx lacks (but nginx plus has).

Extensibility

You can configure Envoy as a Redis proxy, DynamoDB filter, MongoDB filter, grpc proxy, MySQL filter, Thrift filter.

This is not a killer feature, imho, given that most of these protocols support is experimental but anyway it’s nice to have and shows that Envoy is extensible.

It also supports Lua scripting out of the box. For nginx you have to use OpenResty.

What’s not so great about Envoy

The features above alone make a very good reason to use Envoy. However, I found a few things that keep me from switching to Envoy from nginx:

No caching
No static content serving
Lack of flexible configuration
Docker-only packaging

No caching

Envoy doesn’t support caching of responses. This is a must-have feature for the edge proxy and nginx implements it really good.

No static content serving

While Envoy does networking really well, it doesn’t access filesystem apart from initial config file loading and runtime configuration handling. If you thought about serving static files like frontend things (js, html, css) then you’re out of luck - Envoy doesn’t support that. Nginx, again, does it very well.

Lack of flexible configuration

Envoy is configured via YAML and for me its configuration feels very explicit though I think it’s actually a good thing – explicit is better than implicit. But I feel that Envoy configuration is bounded by features specifically implemented in Envoy. Maybe it’s a lack of experience with Envoy and old habits but I feel that in nginx with maps, rewrite module (with if directive) and other nice modules I have a very flexible config system that allows me to implement anything. The cost of this flexibility is, of course, a good portion of complexity – nginx configuration requires some learning and practice but in my opinion it’s worth it.

Nevertheless, Envoy supports dynamic configuration, though it’s not like you can change some configuration part via REST call, it’s about the discovery of configuration settings – that’s what the whole XDS protocol is all about with its EDS, CDS, RDS and what-not-DS.

Citing docs:

Envoy discovers its various dynamic resources via the filesystem or by querying one or more management servers.

Emphasis is mine – I wanted to note that you have to provide a server that will respond to the Envoy discovery (XDS) requests.

However, there is no ready-made solution that implements Envoys’ XDS protocol. There was a rotor but the company behind it shut down so the project is mostly dead.

There is an Istio but it’s a monster I don’t want to touch right now. Also, if you’re on Kubernetes then there is a Heptio Contour, but not everybody needs and uses Kubernetes.

In the end, you could implement your own XDS service using go-control-plane stubs.

But that’s doesn’t seem to be used. What I saw most people do is using DNS for EDS and CDS. Especially, remembering that Consul has DNS interface, it seems that we can use Consul for dynamically providing the list of hosts to the Envoy. This isn’t big news because I can (and do) use Consul to provide the list of backends for nginx by using DNS name in proxy_pass and resolver directive.

Also, Consul Connect support Envoy for proxying requests but this is not about Envoy – this is about how awesome Consul is!

So this whole dynamic configuration thing of Envoy is really confusing and hard to follow because whenever you try to google it you’ll get bombarded with posts about Istio which is distracting.

Docker-only packaging

This is a minor thing but it just annoys me. Also, I don’t like that Docker images don’t have tags with versions. Maybe it’s intended so you always run the latest version but it seems very strange.

Conclusion on not-so-great parts

In the end, I’m not saying Envoy is bad in any way – from my point of view it just has a different focus on advanced proxying and out of process service mesh data plane. The edge proxy part is just a bonus that is suitable in some but not many situations.

What about mirroring

With that being said let’s see Envoy in practice and repeat mirroring experiments from my previous post.

Here are 2 minimal configs – one for nginx and the other Envoy. Both doing the same – simply proxying requests to some backend service.

# nginx proxy config

upstream backend {
    server backend.local:10000;
}

server {
    server_name proxy.local;
    listen 8000;

    location / {
        proxy_pass http://backend;
    }
}

# Envoy proxy config
static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        protocol: TCP
        address: 0.0.0.0
        port_value: 8001
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          stat_prefix: ingress_http
          route_config:
            virtual_hosts:
            - name: local_service
              domains: ['*']
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: backend
          http_filters:
          - name: envoy.router
  clusters:
  - name: backend
    type: STATIC
    connect_timeout: 1s
    hosts:
      - socket_address:
          address: 127.0.0.1
          port_value: 10000

They perform identical:

$ # Load test nginx
$ hey -z 10s -q 1000 -c 1 -t 1 http://proxy.local:8000

Summary:
  Total:	10.0006 secs
  Slowest:	0.0229 secs
  Fastest:	0.0002 secs
  Average:	0.0004 secs
  Requests/sec:	996.7418
  
  Total data:	36881600 bytes
  Size/request:	3700 bytes

Response time histogram:
  0.000 [1]	|
  0.002 [9963]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.005 [3]	|
  0.007 [0]	|
  0.009 [0]	|
  0.012 [0]	|
  0.014 [0]	|
  0.016 [0]	|
  0.018 [0]	|
  0.021 [0]	|
  0.023 [1]	|

...

Status code distribution:
  [200]	9968 responses

$ # Load test Envoy
$ hey -z 10s -q 1000 -c 1 -t 1 http://proxy.local:8001

Summary:
  Total:	10.0006 secs
  Slowest:	0.0307 secs
  Fastest:	0.0003 secs
  Average:	0.0007 secs
  Requests/sec:	996.1445
  
  Total data:	36859400 bytes
  Size/request:	3700 bytes

Response time histogram:
  0.000 [1]	|
  0.003 [9960]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.006 [0]	|
  0.009 [0]	|
  0.012 [0]	|
  0.015 [0]	|
  0.019 [0]	|
  0.022 [0]	|
  0.025 [0]	|
  0.028 [0]	|
  0.031 [1]	|

...

Status code distribution:
  [200]	9962 responses

Anyway, let’s check the crucial part – mirroring to the backend with a delay. A quick reminder – nginx, in that case, will throttle original request thus affecting your production users.

Here is the mirroring config for Envoy:

# Envoy mirroring config
static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        protocol: TCP
        address: 0.0.0.0
        port_value: 8001
    filter_chains:
    - filters:
      - name: envoy.http_connection_manager
        config:
          stat_prefix: ingress_http
          route_config:
            virtual_hosts:
            - name: local_service
              domains: ['*']
              routes:
              - match:
                  prefix: "/"
                route:
                  cluster: backend
                  request_mirror_policy:
                    cluster: mirror
          http_filters:
          - name: envoy.router
  clusters:
  - name: backend
    type: STATIC
    connect_timeout: 1s
    hosts:
      - socket_address:
          address: 127.0.0.1
          port_value: 10000
  - name: mirror
    type: STATIC
    connect_timeout: 1s
    hosts:
      - socket_address:
          address: 127.0.0.1
          port_value: 20000

Basically, we’ve added request_mirror_policy to the main route and defined the cluster for mirroring. Let’s load test it!

$ hey -z 10s -q 1000 -c 1 -t 1 http://proxy.local:8001

Summary:
  Total:	10.0012 secs
  Slowest:	0.0046 secs
  Fastest:	0.0003 secs
  Average:	0.0008 secs
  Requests/sec:	997.6801
  
  Total data:	36918600 bytes
  Size/request:	3700 bytes

Response time histogram:
  0.000 [1]	|
  0.001 [2983]	|■■■■■■■■■■■■■■■■■
  0.001 [6916]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.002 [72]	|
  0.002 [2]	|
  0.002 [0]	|
  0.003 [0]	|
  0.003 [3]	|
  0.004 [0]	|
  0.004 [0]	|
  0.005 [1]	|

...

Status code distribution:
  [200]	9978 responses

Zero errors and amazing latency! This is a victory and it proves that Envoy’s mirroring is truly “fire and forget”!

Conclusion

Envoy’s networking is of exceptional quality – its mirroring is well thought, its load balancing is very advanced and I like the active health check feature.

I’m not convinced to use it in the edge proxy scenario because you might need features of a web server like caching, content serving and advanced configuration.

As for the service mesh – I’ll surely evaluate Envoy for that when the opportunity arises, so stay tuned – subscribe to the Atom feed and check my twitter @AlexDzyoba.

That’s it for now, till the next time!

nginx mirroring tips and tricks

2019-01-14T00:00:00+00:00

Lately, I’ve been playing with nginx and its relatively new mirror module which appeared in 1.13.4. The mirror module allows you to copy requests to another backend while ignoring answers from it. The example use cases for this are:

Pre-production testing by observing how your new system handle real production traffic
Logging of requests for security analysis. This is what Wallarm tool do
Copying requests for data science research
etc.

I’ve used it for pre-production testing of the new rewritten system to see how well (if at all ;-) it can handle the production workload. There are some non-obvious problems and tips that I didn’t find when I started this journey and now I wanted to share it.

Basic setup

Let’s begin with a simple setup. Say, we have some backend that handles production workload and we put a proxy in front of it:

Here is the nginx config:

upstream backend {
    server backend.local:10000;
}

server {
    server_name proxy.local;
    listen 8000;

    location / {
        proxy_pass http://backend;
    }
}

There are 2 parts – backend and proxy. The proxy (nginx) is listening on port 8000 and just passing requests to the backend on port 10000. Nothing fancy, but let’s do a quick load test to see how it performs. I’m using hey tool because it’s simple and allows generating constant load instead of bombarding as hard as possible like many other tools do (wrk, apache benchmark, siege).

$ hey -z 10s -q 1000 -n 100000 -c 1 -t 1 http://proxy.local:8000

Summary:
  Total:	10.0016 secs
  Slowest:	0.0225 secs
  Fastest:	0.0003 secs
  Average:	0.0005 secs
  Requests/sec:	995.8393

  Total data:	6095520 bytes
  Size/request:	612 bytes

Response time histogram:
  0.000 [1]	|
  0.003 [9954]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.005 [4]	|
  0.007 [0]	|
  0.009 [0]	|
  0.011 [0]	|
  0.014 [0]	|
  0.016 [0]	|
  0.018 [0]	|
  0.020 [0]	|
  0.022 [1]	|


Latency distribution:
  10% in 0.0003 secs
  25% in 0.0004 secs
  50% in 0.0005 secs
  75% in 0.0006 secs
  90% in 0.0007 secs
  95% in 0.0007 secs
  99% in 0.0009 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0003 secs, 0.0225 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0008 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0003 secs
  resp wait:	0.0004 secs, 0.0002 secs, 0.0198 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0012 secs

Status code distribution:
  [200]	9960 responses

Good, most of the requests are handled in less than a millisecond and there are no errors – that’s our baseline.

Basic mirroring

Now, let’s put another test backend and mirror traffic to it

The basic mirroring is configured like this:

upstream backend {
    server backend.local:10000;
}

upstream test_backend {
    server test.local:20000;
}

server {
    server_name proxy.local;
    listen 8000;

    location / {
        mirror /mirror;
        proxy_pass http://backend;
    }

    location = /mirror {
        internal;
        proxy_pass http://test_backend$request_uri;
    }

}

We add mirror directive to mirror requests to the internal location and define that internal location. In that internal location we can do whatever nginx allows us to do but for now we just simply proxy pass all requests.

Let’s load test it again to check how mirroring affects the performance:

$ hey -z 10s -q 1000 -n 100000 -c 1 -t 1 http://proxy.local:8000

Summary:
  Total:	10.0010 secs
  Slowest:	0.0042 secs
  Fastest:	0.0003 secs
  Average:	0.0005 secs
  Requests/sec:	997.3967

  Total data:	6104700 bytes
  Size/request:	612 bytes

Response time histogram:
  0.000 [1]	|
  0.001 [9132]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.001 [792]	|■■■
  0.001 [43]	|
  0.002 [3]	|
  0.002 [0]	|
  0.003 [2]	|
  0.003 [0]	|
  0.003 [0]	|
  0.004 [1]	|
  0.004 [1]	|


Latency distribution:
  10% in 0.0003 secs
  25% in 0.0004 secs
  50% in 0.0005 secs
  75% in 0.0006 secs
  90% in 0.0007 secs
  95% in 0.0008 secs
  99% in 0.0010 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0003 secs, 0.0042 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0009 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0002 secs
  resp wait:	0.0004 secs, 0.0002 secs, 0.0041 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0021 secs

Status code distribution:
  [200]	9975 responses

It’s pretty much the same – millisecond latency and no errors. And that’s good because it proves that mirroring itself doesn’t affect original requests.

Mirroring to buggy backend

That’s all nice and dandy but what if mirror backend has some bugs and sometimes replies with errors? What would happen to the original requests?

To test this I’ve made a trivial Go service that can inject errors randomly. Let’s launch it

$ mirror-backend -errors
2019/01/13 14:43:12 Listening on port 20000, delay is 0, error injecting is true

and see what load testing will show:

$ hey -z 10s -q 1000 -n 100000 -c 1 -t 1 http://proxy.local:8000

Summary:
  Total:	10.0008 secs
  Slowest:	0.0027 secs
  Fastest:	0.0003 secs
  Average:	0.0005 secs
  Requests/sec:	998.7205

  Total data:	6112656 bytes
  Size/request:	612 bytes

Response time histogram:
  0.000 [1]	|
  0.001 [7388]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.001 [2232]	|■■■■■■■■■■■■
  0.001 [324]	|■■
  0.001 [27]	|
  0.002 [6]	|
  0.002 [2]	|
  0.002 [3]	|
  0.002 [2]	|
  0.002 [0]	|
  0.003 [3]	|


Latency distribution:
  10% in 0.0003 secs
  25% in 0.0003 secs
  50% in 0.0004 secs
  75% in 0.0006 secs
  90% in 0.0007 secs
  95% in 0.0008 secs
  99% in 0.0009 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0000 secs, 0.0003 secs, 0.0027 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0008 secs
  req write:	0.0000 secs, 0.0000 secs, 0.0001 secs
  resp wait:	0.0004 secs, 0.0002 secs, 0.0026 secs
  resp read:	0.0001 secs, 0.0000 secs, 0.0006 secs

Status code distribution:
  [200]	9988 responses

Nothing changed at all! And that’s great because errors in the mirror backend don’t affect the main backend. nginx mirror module ignores responses to the mirror subrequests so this behavior is nice and intended.

Mirroring to a slow backend

But what if our mirror backend is not returning errors but just plain slow? How original requests will work? Let’s find out!

My mirror backend has an option to delay every request by configured amount of seconds. Here I’m launching it with a 1 second delay:

$ mirror-backend -delay 1
2019/01/13 14:50:39 Listening on port 20000, delay is 1, error injecting is false

So let’s see what load test show:

$ hey -z 10s -q 1000 -n 100000 -c 1 -t 1 http://proxy.local:8000

Summary:
  Total:	10.0290 secs
  Slowest:	0.0023 secs
  Fastest:	0.0018 secs
  Average:	0.0021 secs
  Requests/sec:	1.9942

  Total data:	6120 bytes
  Size/request:	612 bytes

Response time histogram:
  0.002 [1]	|■■■■■■■■■■
  0.002 [0]	|
  0.002 [1]	|■■■■■■■■■■
  0.002 [0]	|
  0.002 [0]	|
  0.002 [0]	|
  0.002 [1]	|■■■■■■■■■■
  0.002 [1]	|■■■■■■■■■■
  0.002 [0]	|
  0.002 [4]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.002 [2]	|■■■■■■■■■■■■■■■■■■■■


Latency distribution:
  10% in 0.0018 secs
  25% in 0.0021 secs
  50% in 0.0022 secs
  75% in 0.0023 secs
  90% in 0.0023 secs
  0% in 0.0000 secs
  0% in 0.0000 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0007 secs, 0.0018 secs, 0.0023 secs
  DNS-lookup:	0.0003 secs, 0.0002 secs, 0.0006 secs
  req write:	0.0001 secs, 0.0001 secs, 0.0002 secs
  resp wait:	0.0011 secs, 0.0007 secs, 0.0013 secs
  resp read:	0.0002 secs, 0.0001 secs, 0.0002 secs

Status code distribution:
  [200]	10 responses

Error distribution:
  [10]	Get http://proxy.local:8000: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

What? 1.9 rps? Where is my 1000 rps? We’ve got errors? What’s happening?

Let me explain how mirroring in nginx works.

How mirroring in nginx works

When the request is coming to nginx and if mirroring is enabled, nginx will create a mirror subrequest and do what mirror location specifies – in our case, it will send it to the mirror backend.

But the thing is that subrequest is linked to the original request, so as far as I understand unless that mirror subrequest is not finished the original requests will throttle.

That’s why we get ~2 rps in the previous test – hey sent 10 requests, got responses, sent next 10 requests but they stalled because previous mirror subrequests were delayed and then timeout kicked in and errored the last 10 requests.

If we increase the timeout in hey to, say, 10 seconds we will receive no errors and 1 rps:

$ hey -z 10s -q 1000 -n 100000 -c 1 -t 10 http://proxy.local:8000

Summary:
  Total:	10.0197 secs
  Slowest:	1.0018 secs
  Fastest:	0.0020 secs
  Average:	0.9105 secs
  Requests/sec:	1.0978

  Total data:	6732 bytes
  Size/request:	612 bytes

Response time histogram:
  0.002 [1]	|■■■■
  0.102 [0]	|
  0.202 [0]	|
  0.302 [0]	|
  0.402 [0]	|
  0.502 [0]	|
  0.602 [0]	|
  0.702 [0]	|
  0.802 [0]	|
  0.902 [0]	|
  1.002 [10]	|■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■


Latency distribution:
  10% in 1.0011 secs
  25% in 1.0012 secs
  50% in 1.0016 secs
  75% in 1.0016 secs
  90% in 1.0018 secs
  0% in 0.0000 secs
  0% in 0.0000 secs

Details (average, fastest, slowest):
  DNS+dialup:	0.0001 secs, 0.0020 secs, 1.0018 secs
  DNS-lookup:	0.0000 secs, 0.0000 secs, 0.0005 secs
  req write:	0.0001 secs, 0.0000 secs, 0.0002 secs
  resp wait:	0.9101 secs, 0.0008 secs, 1.0015 secs
  resp read:	0.0002 secs, 0.0001 secs, 0.0003 secs

Status code distribution:
  [200]	11 responses

So the point here is that if mirrored subrequests are slow then the original requests will be throttled. I don’t know how to fix this but I know the workaround – mirror only some part of the traffic. Let me show you how.

Mirroring part of the traffic

If you’re not sure that mirror backend can handle the original load you can mirror only some part of the traffic – for example, 10%.

mirror directive is not configurable and replicates all requests to the mirror location so it’s not obvious how to do this. The key point in achieving this is the internal mirror location. If you remember I’ve said that you can anything to mirrored requests in its location. So here is how I did this:

 1	upstream backend {
 2	    server backend.local:10000;
 3	}
 4	
 5	upstream test_backend {
 6	    server test.local:20000;
 7	}
 8	
 9	split_clients $remote_addr $mirror_backend {
10	    50% test_backend;
11	    *   "";
12	}
13	
14	server {
15	    server_name proxy.local;
16	    listen 8000;
17	
18	    access_log /var/log/nginx/proxy.log;
19	    error_log /var/log/nginx/proxy.error.log info;
20	
21	    location / {
22	        mirror /mirror;
23	        proxy_pass http://backend;
24	    }
25	
26	    location = /mirror {
27	        internal;
28	        if ($mirror_backend = "") {
29	            return 400;
30	        }
31	
32	        proxy_pass http://$mirror_backend$request_uri;
33	    }
34	
35	}
36

First of all, in mirror location we proxy pass to the upstream that is taken from variable $mirror_backend (line 32). This variable is set in split_client block (lines 9-12) based on client remote address. What split_client does is it sets right variable value based on left variable distribution. In our case, we look at requests remote address ($remote_addr variable) and for 50% of remote addresses we set $mirror_backend to the test_backend, for other requests it’s set to empty string. Finally, the partial part is performed in mirror location – if $mirror_backend variable is empty we reject that mirror subrequest, otherwise we proxy_pass it. Remember that failure in mirror subrequests doesn’t affect original requests so it’s safe to drop request with error status.

The beauty of this solution is that you can split traffic for mirroring based on any variable or combination. If you want to really differentiate your users then remote address may not be the best split key – user may use many IPs or change them. In that case, you’re better off using some user-sticky key like API key. For mirroring 50% of traffic based on apikey query parameter we just change key in split_client:

split_clients $arg_apikey $mirror_backend {
    50% test_backend;
    *   "";
}

When we’ll query apikeys from 1 to 20 only half of it (11) will be mirrored. Here is the curl:

$ for i in {1..20};do curl -i "proxy.local:8000/?apikey=${i}" ;done

and here is the log of mirror backend:

...
2019/01/13 22:34:34 addr=127.0.0.1:47224 host=test_backend uri="/?apikey=1"
2019/01/13 22:34:34 addr=127.0.0.1:47230 host=test_backend uri="/?apikey=2"
2019/01/13 22:34:34 addr=127.0.0.1:47240 host=test_backend uri="/?apikey=4"
2019/01/13 22:34:34 addr=127.0.0.1:47246 host=test_backend uri="/?apikey=5"
2019/01/13 22:34:34 addr=127.0.0.1:47252 host=test_backend uri="/?apikey=6"
2019/01/13 22:34:34 addr=127.0.0.1:47262 host=test_backend uri="/?apikey=8"
2019/01/13 22:34:34 addr=127.0.0.1:47272 host=test_backend uri="/?apikey=10"
2019/01/13 22:34:34 addr=127.0.0.1:47278 host=test_backend uri="/?apikey=11"
2019/01/13 22:34:34 addr=127.0.0.1:47288 host=test_backend uri="/?apikey=13"
2019/01/13 22:34:34 addr=127.0.0.1:47298 host=test_backend uri="/?apikey=15"
2019/01/13 22:34:34 addr=127.0.0.1:47308 host=test_backend uri="/?apikey=17"
...

And the most awesome thing is that partitioning in split_client is consistent – requests with apikey=1 will always be mirrored.

Conclusion

So this was my experience with nginx mirror module so far. I’ve shown you how to simply mirror all of the traffic, how to mirror part of the traffic with the help of split_client module. I’ve also covered error handling and non-obvious problem when normal requests are throttled in case of slow mirror backend.

Hope you’ve enjoyed it! Subscribe to the Atom feed. I also post on twitter @AlexDzyoba.

That’s it for now, till the next time!

tzconv - convert time between timezones

2018-08-15T00:00:00+00:00

I made a nice little thing called tzconv – https://github.com/alexdzyoba/tzconv. It’s a CLI tool that converts time between timezones and it’s useful (at least for me) when you investigate done incident and need to match times.

Imagine, you had an incident that happened at 11:45 your local time but your logs in ELK or Splunk are in UTC. So, what time was 11:45 in UTC?

$ tzconv utc 11:45
08:45

Boom! You got it!

You can add the third parameter to convert time from specific timezone, not from your local. For instance, your alert system sent you an email with a central European time and your server log timestamps are in Eastern time.

$ tzconv neyork 20:20 cet
14:20

Note, that I’ve mistyped New York and it still worked. That’s because locations are not matched exactly but fuzzy searched!

You can find more examples in the project README. Feel free to contribute, I’ve got a couple of things I would like to see implemented – check the issues page. The tool itself is written in Go and quite simple yet useful.

That’s it for now, till the next time!

Peculiarities of c10k client

2018-07-04T00:00:00+00:00

There is a well-known problem called c10k. The essence of it is to handle 10000 concurrent clients on a single server. This problem was conceived in 1999 by Dan Kegel and at that time it made the industry to rethink the way the web servers were handling connections. Then-state-of-the-art solution to allocate a thread for each client started to leak facing the upcoming web scale. Nginx was born to solve this problem by embracing event-driven I/O model provided by a shiny new epoll system call (in Linux).

Times were different back then and now we can have a really beefy server with the 10G network, 32 cores and 256 GiB RAM that can easily handle that amount of clients, so c10k is not much of a problem even with threaded I/O. But, anyway, I wanted to check how various solutions like threads and non-blocking async I/O will handle it, so I started to write some silly servers in my c10k repo and then I’ve stuck because I needed some tools to test my implementations.

Basically, I needed a c10k client. And I actually wrote a couple – one in Go and the other in C with libuv. I’m going to also write the one in Python 3 with asyncio.

While I was writing each client I’ve found 2 peculiarities – how to make it bad and how to make it slow.

How to make it bad

By making bad I mean making it really c10k – creating a lot of connections to the server thus saturation its resources.

Go client

I started with the client in Go and quickly stumbled upon the first roadblock. When I was making 10 concurrent HTTP request with simple "net/http" requests there were only 2 TCP connections

$ lsof -p $(pgrep go-client) -n -P
COMMAND     PID USER   FD      TYPE DEVICE SIZE/OFF    NODE NAME
go-client 11959  avd  cwd       DIR  253,0     4096 1183846 /home/avd/go/src/github.com/dzeban/c10k
go-client 11959  avd  rtd       DIR  253,0     4096       2 /
go-client 11959  avd  txt       REG  253,0  6240125 1186984 /home/avd/go/src/github.com/dzeban/c10k/go-client
go-client 11959  avd  mem       REG  253,0  2066456 3151328 /usr/lib64/libc-2.26.so
go-client 11959  avd  mem       REG  253,0   149360 3152802 /usr/lib64/libpthread-2.26.so
go-client 11959  avd  mem       REG  253,0   178464 3151302 /usr/lib64/ld-2.26.so
go-client 11959  avd    0u      CHR  136,0      0t0       3 /dev/pts/0
go-client 11959  avd    1u      CHR  136,0      0t0       3 /dev/pts/0
go-client 11959  avd    2u      CHR  136,0      0t0       3 /dev/pts/0
go-client 11959  avd    4u  a_inode   0,13        0   12735 [eventpoll]
go-client 11959  avd    8u     IPv4  68232      0t0     TCP 127.0.0.1:55224->127.0.0.1:80 (ESTABLISHED)
go-client 11959  avd   10u     IPv4  68235      0t0     TCP 127.0.0.1:55230->127.0.0.1:80 (ESTABLISHED)

The same with ss¹

$ ss -tnp dst 127.0.0.1:80
State  Recv-Q  Send-Q   Local Address:Port     Peer Address:Port
ESTAB  0       0         127.0.0.1:55224       127.0.0.1:80       users:(("go-client",pid=11959,fd=8))
ESTAB  0       0         127.0.0.1:55230       127.0.0.1:80       users:(("go-client",pid=11959,fd=10))

The reason for this is quite simple – HTTP 1.1 is using persistent connections with TCP keepalive for clients to avoid the overhead of TCP handshake on each HTTP request. Go’s "net/http" fully implements this logic – it multiplexes multiple requests over a handful of TCP connections. It can be tuned via Transport.

But I don’t need to tune it, I need to avoid it. And we can avoid it by explicitly creating TCP connection via net.Dial and then sending a single request over this connection. Here the function that does it and runs concurrently inside a dedicated goroutine.

func request(addr string, delay int, wg *sync.WaitGroup) {
	conn, err := net.Dial("tcp", addr)
	if err != nil {
		log.Fatal("dial error ", err)
	}

	req, err := http.NewRequest("GET", "/index.html", nil)
	if err != nil {
		log.Fatal("failed to create http request")
	}

	req.Host = "localhost"

	err = req.Write(conn)
	if err != nil {
		log.Fatal("failed to send http request")
	}

	_, err = bufio.NewReader(conn).ReadString('\n')
	if err != nil {
		log.Fatal("read error ", err)
	}

	wg.Done()
}

Let’s check it’s working

$ lsof -p $(pgrep go-client) -n -P
COMMAND     PID USER   FD      TYPE DEVICE SIZE/OFF    NODE NAME
go-client 12231  avd  cwd       DIR  253,0     4096 1183846 /home/avd/go/src/github.com/dzeban/c10k
go-client 12231  avd  rtd       DIR  253,0     4096       2 /
go-client 12231  avd  txt       REG  253,0  6167884 1186984 /home/avd/go/src/github.com/dzeban/c10k/go-client
go-client 12231  avd  mem       REG  253,0  2066456 3151328 /usr/lib64/libc-2.26.so
go-client 12231  avd  mem       REG  253,0   149360 3152802 /usr/lib64/libpthread-2.26.so
go-client 12231  avd  mem       REG  253,0   178464 3151302 /usr/lib64/ld-2.26.so
go-client 12231  avd    0u      CHR  136,0      0t0       3 /dev/pts/0
go-client 12231  avd    1u      CHR  136,0      0t0       3 /dev/pts/0
go-client 12231  avd    2u      CHR  136,0      0t0       3 /dev/pts/0
go-client 12231  avd    3u     IPv4  71768      0t0     TCP 127.0.0.1:55256->127.0.0.1:80 (ESTABLISHED)
go-client 12231  avd    4u  a_inode   0,13        0   12735 [eventpoll]
go-client 12231  avd    5u     IPv4  73753      0t0     TCP 127.0.0.1:55258->127.0.0.1:80 (ESTABLISHED)
go-client 12231  avd    6u     IPv4  71769      0t0     TCP 127.0.0.1:55266->127.0.0.1:80 (ESTABLISHED)
go-client 12231  avd    7u     IPv4  71770      0t0     TCP 127.0.0.1:55264->127.0.0.1:80 (ESTABLISHED)
go-client 12231  avd    8u     IPv4  73754      0t0     TCP 127.0.0.1:55260->127.0.0.1:80 (ESTABLISHED)
go-client 12231  avd    9u     IPv4  71771      0t0     TCP 127.0.0.1:55262->127.0.0.1:80 (ESTABLISHED)
go-client 12231  avd   10u     IPv4  71774      0t0     TCP 127.0.0.1:55268->127.0.0.1:80 (ESTABLISHED)
go-client 12231  avd   11u     IPv4  73755      0t0     TCP 127.0.0.1:55270->127.0.0.1:80 (ESTABLISHED)
go-client 12231  avd   12u     IPv4  71775      0t0     TCP 127.0.0.1:55272->127.0.0.1:80 (ESTABLISHED)
go-client 12231  avd   13u     IPv4  73758      0t0     TCP 127.0.0.1:55274->127.0.0.1:80 (ESTABLISHED)

$ ss -tnp dst 127.0.0.1:80
State  Recv-Q  Send-Q   Local Address:Port     Peer Address:Port
ESTAB  0       0         127.0.0.1:55260       127.0.0.1:80     users:(("go-client",pid=12231,fd=8))
ESTAB  0       0         127.0.0.1:55262       127.0.0.1:80     users:(("go-client",pid=12231,fd=9))
ESTAB  0       0         127.0.0.1:55270       127.0.0.1:80     users:(("go-client",pid=12231,fd=11))
ESTAB  0       0         127.0.0.1:55266       127.0.0.1:80     users:(("go-client",pid=12231,fd=6))
ESTAB  0       0         127.0.0.1:55256       127.0.0.1:80     users:(("go-client",pid=12231,fd=3))
ESTAB  0       0         127.0.0.1:55272       127.0.0.1:80     users:(("go-client",pid=12231,fd=12))
ESTAB  0       0         127.0.0.1:55258       127.0.0.1:80     users:(("go-client",pid=12231,fd=5))
ESTAB  0       0         127.0.0.1:55268       127.0.0.1:80     users:(("go-client",pid=12231,fd=10))
ESTAB  0       0         127.0.0.1:55264       127.0.0.1:80     users:(("go-client",pid=12231,fd=7))
ESTAB  0       0         127.0.0.1:55274       127.0.0.1:80     users:(("go-client",pid=12231,fd=13))

C client

I also decided to make a C client built on top of libuv for convenient event loop.

In my C client, there is no HTTP library so we’re making TCP connections from the start. It works well by creating a connection for each request so it doesn’t have the problem (more like feature :-) of the Go client. But when it finishes reading response it stucks and doesn’t return the control to the event loop until the very long timeout.

Here is the response reading callback that seems stuck:

static void on_read(uv_stream_t* stream, ssize_t nread, const uv_buf_t* buf)
{
    if (nread > 0) {
        printf("%s", buf->base);
    } else if (nread == UV_EOF) {
        log("close stream");
        uv_connect_t *conn = uv_handle_get_data((uv_handle_t *)stream);
        uv_close((uv_handle_t *)stream, free_close_cb);
        free(conn);
    } else {
        return_uv_err(nread);
    }

    free(buf->base);
}

It appears like we’re stuck here and wait for some (quite long) time until we finally got EOF.

This “quite long time” is actually HTTP keepalive timeout set in nginx and by default it’s 75 seconds.

We can control it on the client though with Connection and Keep-Alive HTTP headers which are part of HTTP 1.1.

And that’s the only sane solution because on the libuv side I had no way to close the convection – I don’t receive EOF because it is sent only when connection actually closed.

So what is happening is that my client creates a connection, send a request, nginx replies and then nginx is keeping connection because it waits for the subsequent requests. Tinkering with libuv showed me that and that’s why I love making things in C – you have to dig really deep and really understand how things work.

So to solve this hanging requests I’ve just set Connection: close header to enforce the new connection on each request from the same client and to disable HTTP keepalive. As an alternative, I could just insist on HTTP 1.0 where there is no keep-alive.

Now, that it’s creating lots of connections let’s make it keep those connections for a client-specified delay to appear a slow client.

How to make it slow

I needed to make it slow because I wanted my server to spend some time handling the requests while avoiding putting sleeps in the server code.

Initially, I thought to make reading on the client side slow, i.e. reading one byte at a time or delaying reading the server response. Interestingly, none of these solutions worked.

I tested my client with nginx by watching access log with the $request_time variable. Needless to say, all of my requests were served in 0.000 seconds. Whatever delay I’ve inserted, nginx seemed to ignore it.

I started to figure out why by tweaking various parts of the request-response pipeline like the number of connections, response size, etc.

Finally, I was able to see my delay only when nginx was serving really big file like 30 MB and that’s when it clicked.

The whole reason for this delay ignoring behavior were socket buffers. Socket buffers are, well, buffers for sockets, in other words, it’s the piece of memory where the Linux kernel buffers the network requests and responses for performance reason – to send data in big chunks over the network and to mitigate slow clients, and also for other things like TCP retransmission. Socket buffers are like page cache – all network I/O (with page cache it’s disk I/O) is made through it unless explicitly skipped.

So in my case, when nginx received a request, the response written by send/write syscall was merely stored in the socket buffer but from nginx point of view, it was done. Only when the response was large enough to not fit in the socket buffer, nginx would be blocked in syscall and wait until the client delay was elapsed, socket buffer was read and freed for the next portion of data.

You can check and tube the size of the socket buffers in /proc/sys/net/ipv4/tcp_rmem and /proc/sys/net/ipv4/tcp_wmem.

So after figuring this out, I’ve inserted delay after establishing the connection and before sending a request.

This way the server will keep around client connections (yay, c10k!) for a client-specified delay.

Recap

So in the end, I have a 2 c10k clients – one written in Go and the other written in C with libuv. The Python 3 client is on its way.

All of these clients connect to the HTTP server, waits for a specified delay and then sends GET request with Connection: close header.

This makes HTTP server keep a dedicated connection for each request and spend some time waiting to emulate I/O.

That’s how my c10k clients work.

ss stands for socket stats and it’s more versatile tool to inspect sockets than netstat. ↩︎

Configuring JMX exporter for Kafka and Zookeeper

2018-05-12T00:00:00+00:00

I’ve been using Prometheus for quite some time and really enjoying it. Most of the things are quite simple – installing and configuring Prometheus is easy, setting up exporters is launch and forget, instrumenting your code is a bliss. But there are 2 things that I’ve really struggled with:

Grokking data model and PromQL to get meaningful insights.
Configuring jmx-exporter.

In this post, I’ll share the JMX part because I don’t feel that I’ve fully understood the data model and PromQL. So let’s dive into that jmx-exporter thing.

What is jmx-exporter

jmx-exporter is a program that reads JMX data from JVM based applications (e.g. Java and Scala) and exposes it via HTTP in a simple text format that Prometheus understand and can scrape.

JMX is a common technology in Java world for exporting statistics of running application and also to control it (you can trigger GC with JMX, for example).

jmx-exporter is a Java application that uses JMX APIs to collect the app and JVM metrics. It is Java agent which means it is running inside the same JVM. This gives you a nice benefit of not exposing JMX remotely – jmx-exporter will just collect the metrics and exposes it over HTTP in read-only mode.

Installing jmx-exporter

Because it’s written in Java, jmx-exporter is distributed as a jar, so you just need to download it from maven and put it somewhere on your target host.

I have an Ansible role for this – https://github.com/alexdzyoba/ansible-jmx-exporter. Besides downloading the jar it’ll also put the configuration file for jmx-exporter.

This configuration file contains rules for rewriting JMX MBeans to the Prometheus exposition format metrics. Basically, it’s a collection of regexps to convert MBeans strings to Prometheus strings.

The example_configs directory in jmx-exporter sources contains examples for many popular Java apps including Kafka and Zookeeper.

Configuring Zookeeper with jmx-exporter

As I’ve said jmx-exporter runs inside other JVM as java agent to collect JMX metrics. To demonstrate how it all works, let’s run it within Zookeeper.

Zookeeper is a crucial part of many production systems including Hadoop, Kafka and Clickhouse, so you really want to monitor it. Despite the fact that you can do this with 4lw commands (mntr, stat, etc.) and that there are dedicated exporters I prefer to use JMX to avoid constant Zookeeper querying (they add noise to metrics because 4lw commands counted as normal Zookeeper requests).

To scrape Zookeeper JMX metrics with jmx-exporter you have to pass the following arguments to Zookeeper launch:

-javaagent:/opt/jmx-exporter/jmx-exporter.jar=7070:/etc/jmx-exporter/zookeeper.yml

If you use the Zookeeper that is distributed with Kafka (you shouldn’t) then pass it via EXTRA_ARGS:

$ export EXTRA_ARGS="-javaagent:/opt/jmx-exporter/jmx-exporter.jar=7070:/etc/jmx-exporter/zookeeper.yml"
$ /opt/kafka_2.11-0.10.1.0/bin/zookeeper-server-start.sh /opt/kafka_2.11-0.10.1.0/config/zookeeper.properties

If you use standalone Zookeeper distribution then add it as SERVER_JVMFLAGS to the zookeeper-env.sh:

# zookeeper-env.sh
SERVER_JVMFLAGS="-javaagent:/opt/jmx-exporter/jmx-exporter.jar=7070:/etc/jmx-exporter/zookeeper.yml"

Anyway, when you launch Zookeeper you should see the process listening on the specified port (7070 in my case) and responding to /metrics queries:

$ netstat -tlnp | grep 7070
tcp        0      0 0.0.0.0:7070            0.0.0.0:*               LISTEN      892/java

$ curl -s localhost:7070/metrics | head
# HELP jvm_threads_current Current thread count of a JVM
# TYPE jvm_threads_current gauge
jvm_threads_current 16.0
# HELP jvm_threads_daemon Daemon thread count of a JVM
# TYPE jvm_threads_daemon gauge
jvm_threads_daemon 12.0
# HELP jvm_threads_peak Peak thread count of a JVM
# TYPE jvm_threads_peak gauge
jvm_threads_peak 16.0
# HELP jvm_threads_started_total Started thread count of a JVM

Configuring Kafka with jmx-exporter

Kafka is a message broker written in Scala so it runs in JVM which in turn means that we can use jmx-exporter for its metrics.

To run jmx-exporter within Kafka, you should set KAFKA_OPTS environment variable like this:

$ export KAFKA_OPTS='-javaagent:/opt/jmx-exporter/jmx-exporter.jar=7071:/etc/jmx-exporter/kafka.yml'

Then launch the Kafka (I assume that Zookeeper is already launched as it’s required by Kafka):

$ /opt/kafka_2.11-0.10.1.0/bin/kafka-server-start.sh /opt/kafka_2.11-0.10.1.0/conf/server.properties

Check that jmx-exporter HTTP server is listening:

$ netstap -tlnp | grep 7071
tcp6       0      0 :::7071                 :::*                    LISTEN      19288/java

And scrape the metrics!

$ curl -s localhost:7071 | grep -i kafka | head
# HELP kafka_server_replicafetchermanager_minfetchrate Attribute exposed for management (kafka.server<>Value)
# TYPE kafka_server_replicafetchermanager_minfetchrate untyped
kafka_server_replicafetchermanager_minfetchrate{clientId="Replica",} 0.0
# HELP kafka_network_requestmetrics_totaltimems Attribute exposed for management (kafka.network<>Count)
# TYPE kafka_network_requestmetrics_totaltimems untyped
kafka_network_requestmetrics_totaltimems{request="OffsetFetch",} 0.0
kafka_network_requestmetrics_totaltimems{request="JoinGroup",} 0.0
kafka_network_requestmetrics_totaltimems{request="DescribeGroups",} 0.0
kafka_network_requestmetrics_totaltimems{request="LeaveGroup",} 0.0
kafka_network_requestmetrics_totaltimems{request="GroupCoordinator",} 0.0

Here is how to run jmx-exporter java agent if you are running Kafka under systemd:

...
[Service]
Restart=on-failure
Environment=KAFKA_OPTS=-javaagent:/opt/jmx-exporter/jmx-exporter.jar=7071:/etc/jmx-exporter/kafka.yml
ExecStart=/opt/kafka/bin/kafka-server-start.sh /etc/kafka/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh
TimeoutStopSec=600
User=kafka
...

Recap

With jmx-exporter you can scrape the metrics of running JVM applications. jmx-exporter runs as a Java agent (inside the target JVM) scrapes JMX metrics, rewrite it according to config rules and exposes it in Prometheus exposition format.

For a quick setup check my Ansible role for jmx-exporter alexdzyoba.jmx-exporter.

That’s all for now, stay tuned by subscribing to the RSS or follow me on Twitter @AlexDzyoba.

Redis cluster with cross replication

2018-04-21T00:00:00+00:00

In my previous post on Redis high availability, I’ve said that Redis cluster has some sharp corners and promised to tell about it.

This post will cover tricky cases with cross-replicated cluster only because that’s what I use. If you have a plain flat topology with single Redis instances on the dedicated nodes you’ll be fine. But it’s not my case.

So let’s dive in.

Intro

First, let’s define some terms so we understand each other.

Node – physical server or VM where you will run the Redis instance.
Instance – Redis server process in a cluster mode.

Second, let me describe how my Redis cluster topology looks like and what is cross-replication.

Redis cluster is built from multiple Redis instances that are run in a cluster mode. Each instance is isolated because it serves a particular subset of keys in a master or slave role. The emphasis on the role is intentional – there is separate Redis instance for every shard master and every shard replica, e.g. if you have 3 shards with replication factor 3 (2 additional replicas) you have to run 9 Redis instances. This was my first naive attempt to create a cluster on 3 nodes:

$ redis-trib create --replicas 2 10.135.78.153:7000 10.135.78.196:7000 10.135.64.55:7000
>>> Creating cluster
*** ERROR: Invalid configuration for cluster creation.
*** Redis Cluster requires at least 3 master nodes.
*** This is not possible with 3 nodes and 2 replicas per node.
*** At least 9 nodes are required.

(redis-trib is an “official” tool to create a Redis cluster)

The important point here is that all of the Redis tools operate with Redis instances, not nodes, so it’s your responsibility to put the instances in the right redundant topology.

The motivation for cross replication

Redis cluster requires at least 3 nodes because to survive network partition it needs a masters majority (like in Sentinel). If you want 1 replica than add another 3 nodes and boom! now you have a 6 nodes cluster to operate.

It’s fine if you work in the cloud where you can just spin up a dozen of small nodes that cost you a little. Unfortunately, not everyone joined the cloud party and have to operate real metal nodes and server hardware usually starts with something like 32 GiB of RAM and 8 core CPU which is a real overkill for a Redis node.

So to save on hardware we can make a trick and run several instances on a single node (and probably colocate it with other services). But remember that in that case, you have to distribute masters among nodes manually and configure cross-replication.

Cross replication simply means that you don’t have dedicated nodes for replicas, you just replicate the data to the next node.

This way you save on the cluster size – you can make a Redis cluster with 2 replicas on 3 nodes instead of 9. So you have fewer things to operate and nodes are better utilized – instead of one single-threaded lightweight Redis process per 9 nodes now you’ll have 3 such processes on 3 nodes.

To create a cluster you have to run a redis-server with cluster-enabled yes parameter. With a cross-replicated cluster you run multiple Redis instances on a node, so you have to run it on separate ports. You can check these two manuals for details but the essential part are configs. This is the config file I’m using:

protected-mode no
port {{ redis_port }}
daemonize no
loglevel notice
logfile ""
cluster-enabled yes
cluster-config-file nodes-{{ redis_port }}.conf
cluster-node-timeout 5000
cluster-require-full-coverage no
cluster-slave-validity-factor 0

The redis_port variable takes 7000, 7001 and 7002 values for each shard. Launch 3 instances of Redis server with 7000, 7001 and 7002 on each of 3 nodes so you’ll have 9 instances total and let’s continue.

Building a cross-replicated cluster

The first surprise may hit you when you’ll build the cluster. If you invoke the redis-trib like this

$ redis-trib create --replicas 2 10.135.78.153:7000 10.135.78.196:7000 10.135.64.55:7000 10.135.78.153:7001 10.135.78.196:7001 10.135.64.55:7001 10.135.78.153:7002 10.135.78.196:7002 10.135.64.55:7002

then it may put all your master instances on a single node. This is happening because, again, it assumes that each instance lives on the separate node.

So you have to distribute masters and slaves by hand. To do so, first, create a cluster from masters and then add slaves for each master.

# Create a cluster with masters
$ redis-trib create 10.135.78.153:7000 10.135.78.196:7001 10.135.64.55:7002
>>> Creating cluster
>>> Performing hash slots allocation on 3 nodes...
Using 3 masters:
10.135.78.153:7000
10.135.78.196:7001
10.135.64.55:7002
M: 763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000
slots:0-5460 (5461 slots) master
M: f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001
slots:5461-10922 (5462 slots) master
M: 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002
slots:10923-16383 (5461 slots) master
Can I set the above configuration? (type 'yes' to accept): yes
>>> Nodes configuration updated
>>> Assign a different config epoch to each node
>>> Sending CLUSTER MEET messages to join the cluster
Waiting for the cluster to join.
>>> Performing Cluster Check (using node 10.135.78.153:7000)
M: 763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000
slots:0-5460 (5461 slots) master
0 additional replica(s)
M: 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002
slots:10923-16383 (5461 slots) master
0 additional replica(s)
M: f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001
slots:5461-10922 (5462 slots) master
0 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.

This is our cluster now:

127.0.0.1:7000> CLUSTER NODES                
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 myself,master - 0 1524041299000 1 connected 0-5460
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524041299426 2 connected 5461-10922
5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002@17002 master - 0 1524041298408 3 connected 10923-16383

Now add 2 replicas for each master:

$ redis-trib add-node --slave --master-id 763646767dd5492366c3c9f2978faa022833b7af 10.135.78.196:7000 10.135.78.153:7000
$ redis-trib add-node --slave --master-id 763646767dd5492366c3c9f2978faa022833b7af 10.135.64.55:7000 10.135.78.153:7000

$ redis-trib add-node --slave --master-id f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.153:7001 10.135.78.153:7000
$ redis-trib add-node --slave --master-id f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.64.55:7001 10.135.78.153:7000

$ redis-trib add-node --slave --master-id 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.78.153:7002 10.135.78.153:7000
$ redis-trib add-node --slave --master-id 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.78.196:7002 10.135.78.153:7000

Now, this is our brand new cross replicated cluster with 2 replicas:

$ redis-cli -c -p 7000 cluster nodes
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 myself,master - 0 1524041947000 1 connected 0-5460
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524041948515 1 connected
0441f7534aed16123bb3476124506251dab80747 10.135.64.55:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524041947094 1 connected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524043602115 2 connected 5461-10922
f90c932d5cf435c75697dc984b0cbb94c130f115 10.135.78.153:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524043601595 2 connected
00eb2402fc1868763a393ae2c9843c47cd7d49da 10.135.64.55:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524043600057 2 connected
5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002@17002 master - 0 1524041948515 3 connected 10923-16383
af75fc17e552279e5939bfe2df68075b3b6f9b29 10.135.78.153:7002@17002 slave 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 0 1524041948000 3 connected
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 slave 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 0 1524041947094 3 connected

Failover of a cluster node

If we fail (with DEBUG SEGFAULT command) our third node (10.135.64.55) cluster will continue to work:

127.0.0.1:7000> CLUSTER NODES
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 myself,master - 0 1524043923000 1 connected 0-5460
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524043924569 1 connected
0441f7534aed16123bb3476124506251dab80747 10.135.64.55:7000@17000 slave,fail 763646767dd5492366c3c9f2978faa022833b7af 1524043857000 1524043856593 1 disconnected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524043924874 2 connected 5461-10922
f90c932d5cf435c75697dc984b0cbb94c130f115 10.135.78.153:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524043924000 2 connected
00eb2402fc1868763a393ae2c9843c47cd7d49da 10.135.64.55:7001@17001 slave,fail f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 1524043862669 1524043862000 2 disconnected
5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002@17002 master,fail - 1524043864490 1524043862567 3 disconnected
af75fc17e552279e5939bfe2df68075b3b6f9b29 10.135.78.153:7002@17002 slave 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 0 1524043924568 4 connected
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 master - 0 1524043924000 4 connected 10923-16383

We can see that replica on 10.135.78.196:7002 took over the slot range 10923-16383 and now it’s master

127.0.0.1:7000> set a 2
-> Redirected to slot [15495] located at 10.135.78.196:7002
OK

Should we restore Redis instances on the third node cluster will restore

127.0.0.1:7000> CLUSTER nodes
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 myself,master - 0 1524044130000 1 connected 0-5460
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524044131572 1 connected
0441f7534aed16123bb3476124506251dab80747 10.135.64.55:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524044131367 1 connected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524044130334 2 connected 5461-10922
f90c932d5cf435c75697dc984b0cbb94c130f115 10.135.78.153:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524044131876 2 connected
00eb2402fc1868763a393ae2c9843c47cd7d49da 10.135.64.55:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524044131877 2 connected
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 master - 0 1524044131572 4 connected 10923-16383
af75fc17e552279e5939bfe2df68075b3b6f9b29 10.135.78.153:7002@17002 slave 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 0 1524044131000 4 connected
5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002@17002 slave 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 0 1524044131572 4 connected

However, master was not restored back on original node, it’s still on the second node (10.135.78.196). After reboot the third node contains only slave instances

$ redis-cli -c -p 7000 cluster nodes | grep 10.135.64.55
0441f7534aed16123bb3476124506251dab80747 10.135.64.55:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524044294347 1 connected
00eb2402fc1868763a393ae2c9843c47cd7d49da 10.135.64.55:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524044293138 2 connected
5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002@17002 slave 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 0 1524044294553 4 connected

and the second node serve 2 master instances.

$ redis-cli -c -p 7000 cluster nodes | grep 10.135.78.196
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524044345000 1 connected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524044345000 2 connected 5461-10922
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 master - 0 1524044345000 4 connected 10923-16383

Now, what is interesting is that if the second node will fail in this state we’ll lose 2 out of 3 masters and we’ll lose the whole cluster because there is no masters quorum.

$ redis-cli -c -p 7000 cluster nodes
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 myself,master - 0 1524046655000 1 connected 0-5460
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave,fail 763646767dd5492366c3c9f2978faa022833b7af 1524046544940 1524046544000 1 disconnected
0441f7534aed16123bb3476124506251dab80747 10.135.64.55:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524046654010 1 connected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master,fail? - 1524046602511 1524046601582 2 disconnected 5461-10922
f90c932d5cf435c75697dc984b0cbb94c130f115 10.135.78.153:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524046655039 2 connected
00eb2402fc1868763a393ae2c9843c47cd7d49da 10.135.64.55:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524046656075 2 connected
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 master,fail? - 1524046605581 1524046603746 4 disconnected 10923-16383
af75fc17e552279e5939bfe2df68075b3b6f9b29 10.135.78.153:7002@17002 slave 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 0 1524046654623 4 connected
5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002@17002 slave 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 0 1524046654515 4 connected

Let me reiterate that – with cross replicated cluster you may lose the whole cluster after 2 consequent reboots of the single nodes. This is the reason why you’re better off with a dedicated node for each Redis instance, otherwise, with cross replication, we should really watch for masters distribution.

To avoid the situation above we should manually failover one of the slaves on the third node to become a master.

To do this we should connect to the 10.135.64.55:7002 which is replica now and then issue CLUSTER FAILOVER command:

127.0.0.1:7002> CLUSTER FAILOVER
OK

127.0.0.1:7002> CLUSTER NODES
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 master - 0 1524047703000 1 connected 0-5460
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524047703512 1 connected
0441f7534aed16123bb3476124506251dab80747 10.135.64.55:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524047703512 1 connected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524047703000 2 connected 5461-10922
f90c932d5cf435c75697dc984b0cbb94c130f115 10.135.78.153:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524047703000 2 connected
00eb2402fc1868763a393ae2c9843c47cd7d49da 10.135.64.55:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524047703110 2 connected
5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002@17002 myself,master - 0 1524047703000 5 connected 10923-16383
af75fc17e552279e5939bfe2df68075b3b6f9b29 10.135.78.153:7002@17002 slave 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 0 1524047702510 5 connected
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 slave 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 0 1524047702009 5 connected

Replacing a failed node

Now, suppose we’ve lost our third node completely and want to replace it with a completely new node.

$ redis-cli -c -p 7000 cluster nodes
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 myself,master - 0 1524047906000 1 connected 0-5460
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524047906811 1 connected
0441f7534aed16123bb3476124506251dab80747 10.135.64.55:7000@17000 slave,fail 763646767dd5492366c3c9f2978faa022833b7af 1524047871538 1524047869000 1 connected
f90c932d5cf435c75697dc984b0cbb94c130f115 10.135.78.153:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524047908000 2 connected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524047907318 2 connected 5461-10922
00eb2402fc1868763a393ae2c9843c47cd7d49da 10.135.64.55:7001@17001 slave,fail f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 1524047872042 1524047869515 2 connected
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 master - 0 1524047907000 6 connected 10923-16383
af75fc17e552279e5939bfe2df68075b3b6f9b29 10.135.78.153:7002@17002 slave 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 0 1524047908336 6 connected
5f4bb09230ca016e7ffe2e6a4e5a32470175fb66 10.135.64.55:7002@17002 master,fail - 1524047871840 1524047869314 5 connected

First, we have to forget the lost node by issuing CLUSTER FORGET on every single node of the cluster (even slaves).

for id in 0441f7534aed16123bb3476124506251dab80747 00eb2402fc1868763a393ae2c9843c47cd7d49da 5f4bb09230ca016e7ffe2e6a4e5a32470175fb66; do 
    for port in 7000 7001 7002; do 
        redis-cli -c -p ${port} CLUSTER FORGET ${id}
    done
done

Check that we’ve forgotten the failed node:

$ redis-cli -c -p 7000 cluster nodes
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 myself,master - 0 1524048240000 1 connected 0-5460
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524048241342 1 connected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524048240332 2 connected 5461-10922
f90c932d5cf435c75697dc984b0cbb94c130f115 10.135.78.153:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524048240000 2 connected
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 master - 0 1524048241000 6 connected 10923-16383
af75fc17e552279e5939bfe2df68075b3b6f9b29 10.135.78.153:7002@17002 slave 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 0 1524048241845 6 connected

Now spin up a new node, install redis on it and launch 3 new instances with our cluster configuration.

These 3 new nodes doesn’t know anything about the cluster:

[root@redis-replaced ~]# redis-cli -c -p 7000 cluster nodes
9a9c19e24e04df35ad54a8aff750475e707c8367 :7000@17000 myself,master - 0 0 0 connected
[root@redis-replaced ~]# redis-cli -c -p 7001 cluster nodes
3a35ebbb6160232d36984e7a5b97d430077e7eb0 :7001@17001 myself,master - 0 0 0 connected
[root@redis-replaced ~]# redis-cli -c -p 7002 cluster nodes
df701f8b24ae3c68ca6f9e1015d7362edccbb0ab :7002@17002 myself,master - 0 0 0 connected

so we have to add these Redis instances to the cluster:

$ redis-trib add-node --slave --master-id 763646767dd5492366c3c9f2978faa022833b7af 10.135.82.90:7000 10.135.78.153:7000
$ redis-trib add-node --slave --master-id f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.82.90:7001 10.135.78.153:7000
$ redis-trib add-node --slave --master-id 19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.82.90:7002 10.135.78.153:7000

Now we should failover for the third shard:

[root@redis-replaced ~]# redis-cli -c -p 7002 cluster failover
OK

Aaaand, it’s done!

$ redis-cli -c -p 7000 cluster nodes
763646767dd5492366c3c9f2978faa022833b7af 10.135.78.153:7000@17000 myself,master - 0 1524049388000 1 connected 0-5460
f90c932d5cf435c75697dc984b0cbb94c130f115 10.135.78.153:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524049389000 2 connected
af75fc17e552279e5939bfe2df68075b3b6f9b29 10.135.78.153:7002@17002 slave df701f8b24ae3c68ca6f9e1015d7362edccbb0ab 0 1524049388000 7 connected
216a5ea51af1faed7fa42b0c153c91855f769321 10.135.78.196:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524049389579 1 connected
f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 10.135.78.196:7001@17001 master - 0 1524049389579 2 connected 5461-10922
19b8c9f7ac472ecfedd109e6bb7a4b932905c4fd 10.135.78.196:7002@17002 slave df701f8b24ae3c68ca6f9e1015d7362edccbb0ab 0 1524049388565 7 connected
9a9c19e24e04df35ad54a8aff750475e707c8367 10.135.82.90:7000@17000 slave 763646767dd5492366c3c9f2978faa022833b7af 0 1524049389880 1 connected
3a35ebbb6160232d36984e7a5b97d430077e7eb0 10.135.82.90:7001@17001 slave f63c210b13d68fa5dc97ca078af6d9c167f8c6ec 0 1524049389579 2 connected
df701f8b24ae3c68ca6f9e1015d7362edccbb0ab 10.135.82.90:7002@17002 master - 0 1524049389579 7 connected 10923-16383

Recap

If you have to deal with bare metal servers, want a highly available Redis cluster and effectively utilize your hardware you have a good option of building cross replicated topology of Redis cluster.

This will work great but there are 2 caveats:

Cluster building is a manual process because you have to put masters on separate nodes.
You have to monitor your masters’ distribution to avoid cluster failure after a single node failure.

Redis high availability

2018-03-28T00:00:00+00:00

Recently, at the place where I work, we started to use Redis for session-like objects storage. Despite that these objects are small and short-lived, without them our service would stop working, so a question about Redis high availability arose. Turns out, for Redis there is no ready-made solution – there are multiple options with different tradeoffs and information sometimes is a bit scarce and distributed across documentation and blog posts, hence I’m writing this in shy hope of helping another poor soul like myself to solve such problem. I’m by no means a Redis guru but I wanted to share my experience anyway because, after all, it’s my personal blog.

I’m going to describe high availability in terms of node failure and not persistence.

Redis high availability options

Standalone Redis, which is a good old redis-server you launch after installation, is easy to setup and use, but it’s not resilient to the failure of a node it’s running on. It doesn’t matter whether you use RDB or AOF as long as a node is unavailable you are in a trouble.

Over the years, Redis community came up with a few high availability options – most of them are built in Redis itself, though there are some others that are 3rd party tools. Let’s dive into it.

Simple Redis replication

Redis has a replication support since, like, forever and it works great – just put the slaveof in your config file and the instance will start receiving the stream of the data from the master.

You can configure multiple slaves for the master, you can configure slave for a slave, you can enable slave-only persistence, you can make replication synchronous (it’s async by default) – the list of what you can do with Redis seems like bounded only by your imagination. Just read the docs for replication – it’s really great.

Pros:

Quick and simple to setup
Could be automated via configuration management tools
Continue to work as long as a single master instance is available - it can survive failures of all of the slave instances.

Cons:

Writes must go to the master
Slaves may serve reads but because replication is asynchronous you may get stale reads
It doesn’t shard data, so master and slaves will have unbalanced utilization
In case of master failure, you have to elect the new master manually

The last thing is, IMHO, a major downside and that’s where the Redis Sentinel helps.

Redis replication with Sentinel

Nobody wants to wake up in the middle of the night, just to issue the SLAVEOF NO ONE to elect new master – it’s pretty silly and should be automated, right? Right. That’s why Redis Sentinel exists.

Redis Sentinel is the tool that monitors Redis masters and slaves and automatically elects the new master from one of the slaves. It’s a really critical task so you’re better off making Sentinel highly available itself. Luckily, it has a built-in clustering which makes it a distributed system.

Sentinel is a quorum system, meaning that to agree on the new master there should be a majority of Sentinel nodes alive. This has a huge implication on how to deploy Sentinel. There are basically 2 options here – colocate with Redis server or deploy on a separate cluster. Colocating with Redis server makes sense because Sentinel is a very lightweight process, so why pay for additional nodes? But in this case, we lose our resilience because if you colocate Redis server and Sentinel on, say, 3 nodes, you can only lose 1 node because Sentinel needs 2 nodes to elect the new Redis server master. Without Sentinel, we could lose 2 slave nodes. So maybe you should think about a dedicated Sentinel cluster. If you’re on the cloud you could deploy it on some sort of nano instances but maybe it’s not your case. Tradeoffs, tradeoffs, I know.

Besides dealing with maintaining one more distributed system, with Sentinel, you should change the way your clients work with Redis because now your master node can move. For this case, your application should first go to Sentinel, ask it about current master and only then work with it. You can build a clever hack with HAProxy here – instead of going to Sentinel you can put a HAProxy in front of Redis servers to detect the new master with the help of TCP checks. See example at HAProxy blog

Nevertheless, Sentinel colocated with Redis servers is a really common solution for Redis high availability, for example, Gitlab recommends it in its admin guide.

Pros:

Automatically selects new master in case of its failure. Yay!
Easy to setup, (seems) easy to maintain.

Cons:

Yet another distributed system to maintain
May require a dedicated cluster if not colocated with Redis server
Still doesn’t shard data, so master will be overutilized in comparison to slaves

Redis cluster

All of the solutions above seems IMHO half-assed because they add more things and these things are not obvious at least at first sight. I don’t know any other system that solves availability problem by adding yet another cluster that must be available itself. It’s just annoying.

So with recent versions of Redis came the Cluster – a builtin feature that adds sharding, replication and high availability to the known and loved Redis. Within a cluster, you have multiple master instances that serve a subset of the keyspace. Clients may send requests to any of the master instances which will redirect to the correct instance for the given key. Master instances may have as many replicas as they want, and these replicas will be promoted to master automatically even without a quorum. Note, though, that master instances quorum is required for the whole cluster work, but a quorum is not required for the shard working including the new master election.

Each instance in the Redis cluster (master or slave) should be deployed on a dedicated node but you can configure cross replication where each node will contain multiple instances. There are sharp corners here, though, that I’ll illustrate in the next post, so stay tuned!

Pros:

Shards data across multiple nodes
Has replication support
Has builtin failover of the master

Cons:

Not every library supports it
May not be as robust (yet) as standalone Redis or Sentinel
Tooling is wack, building and maintaining (replacing a node) cluster is a manual process
Introduces an extra network hop in case we missed the shard.

Twemproxy

Twemproxy is a special proxy for in-memory databases – namely, memcached and Redis – that was built by Twitter. It adds sharding with consistent hashing, so resharding is not that painful, and also maintains persistent connections and enables requests/response pipelining.

I haven’t tried it because in the era of Redis cluster it doesn’t seem relevant to me anymore, so I couldn’t tell pros and cons, but YMMV.

Redis Enterprise

After the initial post, quite a few people reached out to me telling that they have great success with Redis Enterprise from Redis Labs. Check out this one from Reddit. The point is that if you have a really high workload and your data is more critical and you can afford it then you should consider their solution.

You may also check their guide on Redis High Availability – it is also well written and illustrated.

Conclusion

Choosing the right solution for Redis high availability is full of tradeoffs. Nobody knows your situation better than you, so get to know how Redis works – there is no magic here – in the end, you’ll have to maintain the solution. In my case, we have chosen a Redis cluster with cross replication after lots of testing and writing a doc with instructions on how to deal with failures.

That’s all for now, stay tuned for the dedicated Redis cluster post!

How to use Ansible with Terraform

2018-03-09T00:00:00+00:00

Recently, I’ve started using Terraform for creating a cloud test rig and it’s pretty dope. In a matter of a few days, I went from “never used AWS” to the “I have a declarative way to create an isolated infrastructure in the cloud”. I’m spinning a couple of instances in a dedicated subnet inside a VPC with a security group and dedicated SSH keypair and all of this is coded in a mere few hundred lines.

It’s all nice and dandy but after creating an instance from some basic AMI I need to provision it. My go-to tool for this is Ansible but, unfortunately, Terraform doesn’t support it natively as it does for Chef and Salt. This is unlike Packer that has ansible (remote) and ansible-local that I’ve used for creating a Docker image.

So I’ve spent some time and found a few ways to marry Terraform with Ansible that I’ll describe hereafter. But first, let’s talk about provisioning.

Do we really need provisioning in the cloud?

Instead of using the empty AMIs you could bake your own AMI and skip the whole provisioning part completely but I see a giant flaw in this setup. Every change, even a small one, requires recreation of the whole instance. If it’s a change somewhere on the base level then you’ll need to recreate your whole fleet. It quickly becomes unusable in case of deployment, security patching, adding/removing a user, changing config and other simple things.

Even more so if you bake your own AMIs then you should again provision it somehow and that’s where things like Ansible appears again. My recommendation here is again to use Packer with Ansible.

So in the most cases, I’m strongly for the provisioning because it’s unavoidable anyway.

How to use Ansible with Terraform

Now, returning to the actual provisioning I found 3 ways to use Ansible with Terraform after reading the heated discussion at [this GitHub issue] (https://github.com/hashicorp/terraform/issues/2661). Read on to find the one that’s most suitable for you.

Inline inventory with instance IP

One of the most obvious yet hacky solutions is to invoke Ansible within local-exec provisioner. Here is how it looks like:

provisioner "local-exec" {
    command = "ansible-playbook -i '${self.public_ip},' --private-key ${var.ssh_key_private} provision.yml"
}

Nice and simple, but there is a problem here. local-exec provisioner starts without waiting for an instance to launch, so in the most cases, it will fail because by the time it will try to connect there is nobody listening.

As a nice workaround, you can use preliminary remote-exec provisioner that will wait until the connection to the instance is established and then invoke the local-exec provisioner.

As a result, I have this thingy that plays the role of “Ansible provisioner”

  provisioner "remote-exec" {
    inline = ["sudo dnf -y install python"]

    connection {
      type        = "ssh"
      user        = "fedora"
      private_key = "${file(var.ssh_key_private)}"
    }
  }

  provisioner "local-exec" {
    command = "ansible-playbook -u fedora -i '${self.public_ip},' --private-key ${var.ssh_key_private} provision.yml" 
  }

To make ansible-playbook work you have to have an Ansible code in the same directory with Terraform code like this:

$ ll infra
drwxrwxr-x. 3 avd avd 4.0K Mar  5 15:54 roles/
-rw-rw-r--. 1 avd avd  367 Mar  5 15:19 ansible.cfg
-rw-rw-r--. 1 avd avd 2.5K Mar  7 18:54 main.tf
-rw-rw-r--. 1 avd avd  454 Mar  5 15:27 variables.tf
-rw-rw-r--. 1 avd avd   38 Mar  5 15:54 provision.yml

This inline inventory will work in most cases, except when you need multiple hosts in inventory. For example, when you setup Consul agent you need a list of Consul servers for rendering a config and that is usually found in the usual inventory. So but it won’t work here because you have a single host in your inventory.

Anyway, I’m using this approach for the basic things like setting up users and installing some basic packages.

Dynamic inventory after Terraform

Another simple solution for provisioning infrastructure created by Terraform is just don’t tie Terraform and Ansible together. Create infrastructure with Terraform and then use Ansible with dynamic inventory regardless of how your instances were created.

So you first create an infra with terraform apply and then you invoke ansible-playbook -i inventory site.yml, where inventory dir contains dynamic inventory scripts.

This will work great but has a little drawback – if you need to increase the number of instances you must remember to launch Ansible after Terraform.

That’s what I use complementary to the previous approach.

Inventory from Terraform state

There is another interesting thing that might work for you – generate static inventory from Terraform state.

When you work with Terraform it maintains the state of the infrastructure that contains everything including your instances. With a local backend, this state is stored in a JSON file that can be easily parsed and converted to the Ansible inventory.

Here are 2 projects with examples that you can use if you want to go this way.

https://github.com/adammck/terraform-inventory

$ terraform-inventory -inventory terraform.tfstate
[all]
52.51.215.84

[all:vars]

[server]
52.51.215.84

[server.0]
52.51.215.84

[type_aws_instance]
52.51.215.84

[name_c10k server]
52.51.215.84

[%_1]
52.51.215.84

https://github.com/express42/terraform-ansible-example/blob/master/ansible/terraform.py

$ ~/soft/terraform.py --root . --hostfile
## begin hosts generated by terraform.py ##
52.51.215.84    	C10K Server
## end hosts generated by terraform.py ##

IMHO, I don’t see a point in this approach.

Ansible plugin for Terraform that didn’t work for me

Finally, there are few projects that try to make a native looking Ansible provisioner for Terraform like builtin Chef provisioner.

https://github.com/jonmorehouse/terraform-provisioner-ansible – this was the first attempt to make such plugin but, unfortunately, it’s not currently maintained and moreover it’s not supported by the current Terraform plugin system.

https://github.com/radekg/terraform-provisioner-ansible – this one is more recent and currently maintained. It enables this kind of provisioning:

...
provisioner "ansible" {
    plays {
        playbook = "./provision.yml"
        hosts = ["${self.public_ip}"]
    }
    become = "yes"
    local = "yes"
}
...

Unfortunately, I wasn’t able to make it work so I blew it off because first 2 solutions cover all of my cases.

Conclusion

Terraform and Ansible is a powerful combo that I use for provisioning cloud infrastructure. For basic cloud instances setup, I invoke Ansible with local-exec and later I invoke Ansible separately with dynamic inventory.

You can find an example of how I do it at c10k/infrastructure

Thanks! Until next time!

Instrumenting a Go service for Prometheus

2018-02-03T00:00:00+00:00

I’m the big proponent of the DevOps practices and always been keen to operate things I’ve developed. That’s why I’m really excited about DevOps, SRE, Observability, Service Discovery and other great things which I believe will transform our industry to be truly software engineering. In this blog I’m trying to (among other cool stuff I’m doing) share examples of how you can help yourself or your grumpy Ops guys to operate your service. Last time we developed a typical web service, serving a data from key-value storage, and added Consul integration into it for Service Discovery. This time we are going to instrument our code for monitoring.

Why instrument?

At first, you may wonder why should we instrument our code, why not collect metrics needed for the monitoring from the outside like just install Zabbix agent or setup Nagios checks? There is nothing really wrong with that solution where you treat monitoring targets as black boxes. Though there is another way to do that – white-box monitoring – where your services provide metrics themselves as a result of instrumentation. It’s not really about choosing only one way of doing things – both of these solutions may, and should, supplement each other. For example, you may treat your database servers as a black box providing metrics such as available memory, while instrumenting your database access layer to measure DB request latency.

It’s all about different points of view and it was discussed in Google’s SRE book:

The simplest way to think about black-box monitoring versus white-box monitoring is that black-box monitoring is symptom-oriented and represents active—not predicted—problems: “The system isn’t working correctly, right now.” White-box monitoring depends on the ability to inspect the innards of the system, such as logs or HTTP endpoints, with instrumentation. White-box monitoring, therefore, allows detection of imminent problems, failures masked by retries, and so forth. … When collecting telemetry for debugging, white-box monitoring is essential. If web servers seem slow on database-heavy requests, you need to know both how fast the web server perceives the database to be, and how fast the database believes itself to be. Otherwise, you can’t distinguish an actually slow database server from a network problem between your web server and your database.

My point is that to gain a real observability of your system you should supplement your existing black-box monitoring with a white-box by instrumenting your services.

What to instrument

Now, after we convinced that instrumenting is a good thing let’s think about what to monitor. A lot of people say that you should instrument everything you can, but I think it’s over-engineering and you should instrument for things that really matter to avoid codebase complexity and unnecessary CPU cycles in your service for collecting the bloat of metrics.

So what are those things that really matter that we should instrument for? Well, the same SRE book defines the so-called four golden signals of monitoring:

Traffic or Request Rate
Errors
Latency or Duration of the requests
Saturation

Out of these 4 signals, saturation is the most confusing because it’s not clear how to measure it or if it’s even possible in a software system. I see saturation mostly for the hardware resources which I’m not going to cover here, check the Brendan Gregg’s USE method for this.

Because saturation is hard to measure in a software system, there is a service tailored version of 4 golden signals which is called “the RED method”, which lists 3 metrics:

Request rate
Errors
Duration (latency) distribution

That’s what we’ll instrument for in the webkv service.

We will use Prometheus to monitor our service because it’s go-to tool for monitoring these days – it’s simple, easy to setup and fast. We will need Prometheus Go client library for instrumenting our code.

Instrumenting HTTP handlers

Prometheus works by pulling data from /metrics HTTP handler that serves metrics in a simple text-based exposition format so we need to calculate RED metrics and export it via a dedicated endpoint.

Luckily, all of these metrics can be easily exported with an InstrumentHandler helper.

diff --git a/webkv.go b/webkv.go
index 94bd025..f43534f 100644
--- a/webkv.go
+++ b/webkv.go
@@ -9,6 +9,7 @@ import (
        "strings"
        "time"
 
+       "github.com/prometheus/client_golang/prometheus"
        "github.com/prometheus/client_golang/prometheus/promhttp"
 
        "github.com/alexdzyoba/webkv/service"
@@ -32,7 +33,7 @@ func main() {
        if err != nil {
                log.Fatal(err)
        }
-       http.Handle("/", s)
+       http.Handle("/", prometheus.InstrumentHandler("webkv", s))
        http.Handle("/metrics", promhttp.Handler())
 
        l := fmt.Sprintf(":%d", *port)

and now to export the metrics via /metrics endpoint just add another 2 lines:

diff --git a/webkv.go b/webkv.go
index 1b2a9d7..94bd025 100644
--- a/webkv.go
+++ b/webkv.go
@@ -9,6 +9,8 @@ import (
        "strings"
        "time"
 
+       "github.com/prometheus/client_golang/prometheus/promhttp"
+
        "github.com/alexdzyoba/webkv/service"
 )
 
@@ -31,6 +33,7 @@ func main() {
                log.Fatal(err)
        }
        http.Handle("/", s)
+       http.Handle("/metrics", promhttp.Handler())
 
        l := fmt.Sprintf(":%d", *port)
        log.Print("Listening on ", l)

And that’s it!

No, seriously, that’s all you need to do to make your service observable. It’s so nice and easy that you don’t have excuses for not doing it.

InstrumentHandler conveniently wraps your handler and export the following metrics:

http_request_duration_microseconds summary with 50, 90 and 99 percentiles
http_request_size_bytes summary with 50, 90 and 99 percentiles
http_response_size_bytes summary with 50, 90 and 99 percentiles
http_requests_total counter labeled by status code and handler

promhttp.Handler also exports Go runtime information like a number of goroutines and memory stats.

The point is that you export simple metrics that you can easily calculate on the service and everything else is done with Prometheus and its powerful query language PromQL.

Scraping metrics with Prometheus

Now you need to tell Prometheus about your services so it will start scraping them. We could’ve hard code our endpoint with static_configs pointing it to the ’localhost:8080’. But remember how we previously registered out service in Consul? Prometheus can discover targets for scraping from Consul for our service and any other services with a single job definition:

- job_name: 'consul'
  consul_sd_configs:
    - server: 'localhost:8500'
  relabel_configs:
    - source_labels: [__meta_consul_service]
      target_label: job

That’s the pure awesomeness of Service Discovery! Your ops buddy will thank you for that :-)

(relabel_configs is needed because otherwise all services would be scraped as consul)

Check that Prometheus recognized new targets:

Yay!

The RED method metrics

Now let’s calculate the metrics for the RED method. First one is the request rate and it can be calculated from http_requests_total metric like this:

rate(http_requests_total{job="webkv",code=~"^2.*"}[1m])

We filter HTTP request counter for the webkv job and successful HTTP status code, get a vector of values for the last 1 minute and then take a rate, which is basically a diff between first and last values. This gives us the amount of request that was successfully handled in the last minute. Because counter is accumulating we’ll never miss values even if some scrape failed.

The second one is the errors that we can calculate from the same metric as a rate but what we actually want is a percentage of errors. This is how I calculate it:

sum(rate(http_requests_total{job=“webkv”,code!~"^2.*"}[1m])) / sum(rate(http_requests_total{job=“webkv”}[1m])) * 100

In this error query, we take the rate of error requests, that is the ones with non 2xx status code. This will give us multiple series for each status code like 404 or 500 so we need to sum them. Next, we do the same sum and rate but for all of the requests regardless of its status to get the overall request rate. And finally, we divide and multiply by 100 to get a percentage.

Finally, the latency distribution lies directly in http_request_duration_microseconds metric:

http_request_duration_microseconds{job="webkv"}

So that was easy and it’s more than enough for my simple service.

If you want to instrument for some custom metrics you can do it easily. I’ll show you how to do the same for the Redis requests that are made from the webkv handler. It’s not of a much use because there is a dedicated Redis exporter for Prometheus but, anyway, it’s just for the illustration.

Instrumenting for the custom metrics (Redis requests)

As you can see from the previous sections all we need to get the meaningful monitoring are just 2 metrics – a plain counter for HTTP request quantified on status code and a summary for request durations.

Let’s start with the counter. First, to make things nice, we define a new type Metrics with Prometheus CounterVec and add it to the Service struct:

--- a/service/service.go
+++ b/service/service.go
@@ -13,6 +14,7 @@ type Service struct {
        Port        int
        RedisClient redis.UniversalClient
        ConsulAgent *consul.Agent
+       Metrics     Metrics
 }
+
+type Metrics struct {
+       RedisRequests *prometheus.CounterVec
+}
+

Next, we must register our metric:

--- a/service/service.go
+++ b/service/service.go
@@ -28,6 +30,15 @@ func New(addrs []string, ttl time.Duration, port int) (*Service, error) {
                Addrs: addrs,
        })
 
+       s.Metrics.RedisRequests = prometheus.NewCounterVec(
+               prometheus.CounterOpts{
+                       Name: "redis_requests_total",
+                       Help: "How many Redis requests processed, partitioned by status",
+               },
+               []string{"status"},
+       )
+       prometheus.MustRegister(s.Metrics.RedisRequests)
+
        ok, err := s.Check()
        if !ok {
                return nil, err

We have created a variable of CounterVec type because plain Counter is for a single time series and we have a label for status, which makes it a vector of time series.

Finally, we need to increment the counter depending on the status:

--- a/service/redis.go
+++ b/service/redis.go
@@ -15,7 +15,9 @@ func (s *Service) ServeHTTP(w http.ResponseWriter, r *http.Request) {
        if err != nil {
                http.Error(w, "Key not found", http.StatusNotFound)
                status = 404
+               s.Metrics.RedisRequests.WithLabelValues("fail").Inc()
        }
+       s.Metrics.RedisRequests.WithLabelValues("success").Inc()
 
        fmt.Fprint(w, val)
        log.Printf("url=\"%s\" remote=\"%s\" key=\"%s\" status=%d\n",

Check, that it’s working:

$ curl -s 'localhost:8080/metrics' | grep redis
# HELP redis_requests_total How many Redis requests processed, partitioned by status
# TYPE redis_requests_total counter
redis_requests_total{status="fail"} 904
redis_requests_total{status="success"} 5433

Nice!

Calculating latency distribution is a little bit more involved because we have to time our requests and put it in distribution buckets. Fortunately, there is a very nice prometheus.Timer helper to help measure time. As for the distribution buckets, Prometheus has a Summary type that does it automatically.

Ok, so first we have to register our new metric (adding it to our Metrics type):

--- a/service/service.go
+++ b/service/service.go
@@ -18,7 +18,8 @@ type Service struct {
 }
 
 type Metrics struct {
        RedisRequests  *prometheus.CounterVec
+       RedisDurations prometheus.Summary
 }
 
 func New(addrs []string, ttl time.Duration, port int) (*Service, error) {
@@ -39,6 +40,14 @@ func New(addrs []string, ttl time.Duration, port int) (*Service, error) {
        )
        prometheus.MustRegister(s.Metrics.RedisRequests)
 
+       s.Metrics.RedisDurations = prometheus.NewSummary(
+               prometheus.SummaryOpts{
+                       Name:       "redis_request_durations",
+                       Help:       "Redis requests latencies in seconds",
+                       Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
+               })
+       prometheus.MustRegister(s.Metrics.RedisDurations)
+
        ok, err := s.Check()
        if !ok {
                return nil, err

Our new metrics is just a Summary, not a SummaryVec because we have no labels. We defined 3 “objectives” – basically 3 buckets for calculating distribution – 50, 90 and 99 percentiles.

Here is how we measure request latency:

--- a/service/redis.go
+++ b/service/redis.go
@@ -5,12 +5,18 @@ import (
        "log"
        "net/http"
        "strings"
+
+       "github.com/prometheus/client_golang/prometheus"
 )
 
 func (s *Service) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    status := 200
 
    key := strings.Trim(r.URL.Path, "/")
+
+   timer := prometheus.NewTimer(s.Metrics.RedisDurations)
+   defer timer.ObserveDuration()
+
    val, err := s.RedisClient.Get(key).Result()
    if err != nil {
            http.Error(w, "Key not found", http.StatusNotFound)
			status = 404
			s.Metrics.RedisRequests.WithLabelValues("fail").Inc()
		}
	s.Metrics.RedisRequests.WithLabelValues("success").Inc()

	fmt.Fprint(w, val)
	log.Printf("url=\"%s\" remote=\"%s\" key=\"%s\" status=%d\n",
		r.URL, r.RemoteAddr, key, status)
}

Yep, it’s that easy. You just create a new timer and defer it’s invocation so it will be invoked on the function exit. Although it will additionaly calculate a logging I’m okay with that.

By default, this timer measure time in seconds. To mimic http_request_duration_microseconds we can implement Observer interface that NewTimer accepts that does the calculation our way:

--- a/service/redis.go
+++ b/service/redis.go
@@ -14,7 +14,10 @@ func (s *Service) ServeHTTP(w http.ResponseWriter, r *http.Request) {
 
        key := strings.Trim(r.URL.Path, "/")
 
-       timer := prometheus.NewTimer(s.Metrics.RedisDurations)
+       timer := prometheus.NewTimer(prometheus.ObserverFunc(func(v float64) {
+               us := v * 1000000 // make microseconds
+               s.Metrics.RedisDurations.Observe(us)
+       }))
        defer timer.ObserveDuration()
 
        val, err := s.RedisClient.Get(key).Result()

--- a/service/service.go
+++ b/service/service.go
@@ -43,7 +43,7 @@ func New(addrs []string, ttl time.Duration, port int) (*Service, error) {
        s.Metrics.RedisDurations = prometheus.NewSummary(
                prometheus.SummaryOpts{
                        Name:       "redis_request_durations",
-                       Help:       "Redis requests latencies in seconds",
+                       Help:       "Redis requests latencies in microseconds",
                        Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
                })
        prometheus.MustRegister(s.Metrics.RedisDurations)

That’s it!

$ curl -s 'localhost:8080/metrics' | grep -P '(redis.*durations)'
# HELP redis_request_durations Redis requests latencies in microseconds
# TYPE redis_request_durations summary
redis_request_durations{quantile="0.5"} 207.17399999999998
redis_request_durations{quantile="0.9"} 230.399
redis_request_durations{quantile="0.99"} 298.585
redis_request_durations_sum 3.290851703000006e+06
redis_request_durations_count 15728

And now, when we have beautiful metrics let’s make a dashboard for them!

Grafana dashboard

It’s no secret, that once you have a Prometheus, you will eventually have Grafana to show dashboards for your metrics because Grafana has builtin support for Prometheus as a data source.

In my dashboard, I’ve just put our RED metrics and sprinkled some colors. Here is the final dashboard:

Note, that for latency graph, I’ve created 3 series for each of the 0.5, 0.9 and 0.99 quantiles, and divided it by 1000 for millisecond values.

Conclusion

There is no magic here, monitoring the four golden signals or the RED metrics is easy with modern tools like Prometheus and Grafana and you really need it because without it you’re flying blind. So the next time you will develop any service, just add some instrumentation – be nice and cultivate at least some operational sympathy for great good.

Hitchhiker's guide to the Python imports

2018-01-13T00:00:00+00:00

Disclaimer: If you write Python on a daily basis you will find nothing new in this post. It’s for people who occasionally use Python like Ops guys and forget/misuse its import system. Nonetheless, the code is written with Python 3.6 type annotations to entertain an experienced Python reader. As usual, if you find any mistakes, please let me know!

Modules

Let’s start with a common Python stanza of

if __name__ == '__main__':
    invoke_the_real_code()

A lot of people, and I’m not an exception, write it as a ritual without trying to understand it. We somewhat know that this snippet makes difference when you invoke your code from CLI versus import it. But let’s try to understand why we really need it.

For illustration, assume that we’re writing some pizza shop software. It’s on Github. Here is the pizza.py file.

# pizza.py file

import math

class Pizza:
    name: str = ''
    size: int = 0
    price: float = 0

    def __init__(self, name: str, size: int, price: float) -> None:
        self.name = name
        self.size = size
        self.price = price

    def area(self) -> float:
        return math.pi * math.pow(self.size / 2, 2)

    def awesomeness(self) -> int:
        if self.name == 'Carbonara':
            return 9000

        return self.size // int(self.price) * 100

print('pizza.py module name is %s' % __name__)
if __name__ == '__main__':
    print('Carbonara is the most awesome pizza.')

I’ve added printing of the magical __name__ variable to see how it may change.

OK, first, let’s run it as a script:

$ python3 pizza.py
pizza.py module name is __main__
Carbonara is the most awesome pizza.

Indeed, the __name__ global variable is set to the __main__ when we invoke it from CLI.

But what if we import it from another file? Here is the menu.py source code:

# menu.py file

from typing import List
from pizza import Pizza

MENU: List[Pizza] = [
    Pizza('Margherita', 30, 10.0),
    Pizza('Carbonara', 45, 14.99),
    Pizza('Marinara', 35, 16.99),
]

if __name__ == '__main__':
    print(MENU)

Run menu.py

$ python3 menu.py
pizza.py module name is pizza
[, , ]

And now we see 2 things:

The top-level print statement from pizza.py was executed on import
__name__ in pizza.py is now set to the filename without .py suffix.

So, the thing is, __name__ is the global variable that holds the name of the current Python module.

Module name is set by the interpreter in __name__ variable
When module is invoked from CLI its name is set to __main__

So what is the module, after all? It’s really simple - module is a file containing Python code that you can execute with the interpreter (the python program) or import from other modules.

Python module is just a file with Python code

Just like when executing, when the module is being imported, its top-level statements are executed, but be aware that it’ll be executed only once even if you import it several times even from different files.

When you import module it’s executed

Because modules are just plain files, there is a simple way to import them. Just take the filename, remove the .py extension and put it in the import statement.

To import modules you use the filename without the .py extensions

What is interesting is that __name__ is set to the filename regardless how you import it – with import pizza as broccoli __name__ will still be the pizza. So

When imported, the module name is set to filename without .py extension even if it’s renamed with import module as othername

But what if the module that we import is not located in the same directory, how can we import it? The answer is in module search path that we’ll eventually discover while discussing packages.

Packages

Package is a namespace for a collection of modules

The namespace part is important because by itself package doesn’t provide any functionality – it only gives you a way to group a bunch of your modules.

There are 2 cases where you really want to put modules into a package. First is to isolate definitions of one module from the other. In our pizza module, we have a Pizza class that might conflict with other’s Pizza packages (and we do have some pizza packages on pypi)

The second case is if you want to distribute your code because

Package is the minimal unit of code distribution in Python

Everything that you see on PyPI and install via pip is a package, so in order to share your awesome stuff, you have to make a package out of it.

Alright, assume we’re convinced and want to convert our 2 modules into a nice package. To do this we need to create a directory with empty __init__.py file and move our files to it:

pizzapy/
├── __init__.py
├── menu.py
└── pizza.py

And that’s it – now you have a pizzapy package!

To make a package create the directory with __init__.py file

Remember that package is a namespace for modules, so you don’t import the package itself, you import a module from a package.

>>> import pizzapy.menu
pizza.py module name is pizza
>>> pizzapy.menu.MENU
[, , ]

If you do the import that way, it may seem too verbose because you need to use the fully qualified name. I guess that’s intentional behavior because one of the Python Zen items is “explicit is better than implicit”.

Anyway, you can always use a from package import module form to shorten names:

>>> from pizzapy import menu
pizza.py module name is pizza
>>> menu.MENU
[, , ]

Package init

Remember how we put a __init__.py file in a directory and it magically became a package? That’s a great example of convention over configuration – we don’t need to describe any configuration or register anything. Any directory with __init__.py by convention is a Python package.

Besides making a package __init__.py conveys one more purpose – package initialization. That’s why it’s called init after all! Initialization is triggered on the package import, in other words importing a package invokes __init__.py

When you import a package, the __init__.py module of the package is executed

In the __init__ module you can do anything you want, but most commonly it’s used for some package initialization or setting the special __all__ variable. The latter controls star import – from package import *.

And because Python is awesome we can do pretty much anything in the __init__ module, even really strange things. Suppose we don’t like the explicitness of import and want to drag all of the modules’ symbols up to the package level, so we don’t have to remember the actual module names.

To do that we can import everything from menu and pizza modules in __init__.py like this

# pizzapy/__init__.py

from pizzapy.pizza import *
from pizzapy.menu import *

See:

>>> import pizzapy
pizza.py module name is pizzapy.pizza
pizza.py module name is pizza
>>> pizzapy.MENU
[, , ]

No more pizzapy.menu.Menu or menu.MENU :-) That way it kinda works like packages in Go, but note that this is discouraged because you are trying to abuse the Python and if you gonna check in such code you gonna have a bad time at code review. I’m showing you this just for the illustration, don’t blame me!

You could rewrite the import more succinctly like this

# pizzapy/__init__.py

from .pizza import *
from .menu import *

This is just another syntax for doing the same thing which is called relative imports. Let’s look at it closer.

Absolute and relative imports

The 2 code pieces above is the only way of doing so-called relative import because since Python 3 all imports are absolute by default (as in PEP328), meaning that import will try to import standard modules first and only then local packages. This is needed to avoid shadowing of standard modules when you create your own sys.py module and doing import sys could override the standard library sys module.

Since Python 3 all import are absolute by default – it will look for system package first

But if your package has a module called sys and you want to import it into another module of the same package you have to make a relative import. To do it you have to be explicit again and write from package.module import somesymbol or from .module import somesymbol. That funny single dot before module name is read as “current package”.

To make a relative import prepend the module with the package name or dot

Executable package

In Python you can invoke a module with a python3 -m construction.

$ python3 -m pizza
pizza.py module name is __main__
Carbonara is the most awesome pizza.

But packages can also be invoked this way:

$ python3 -m pizzapy
/usr/bin/python3: No module named pizzapy.__main__; 'pizzapy' is a package and cannot be directly executed

As you can see, it needs a __main__ module, so let’s implement it:

# pizzapy/__main__.py

from pizzapy.menu import MENU

print('Awesomeness of pizzas:')
for pizza in MENU:
    print(pizza.name, pizza.awesomeness())

And now it works:

$ python3 -m pizzapy
pizza.py module name is pizza
Awesomeness of pizzas:
Margherita 300
Carbonara 9000
Marinara 200

Adding __main__.py makes package executable (invoke it with python3 -m package)

Import sibling packages

And the last thing I want to cover is the import of sibling packages. Suppose we have a sibling package pizzashop:

.
├── pizzapy
│   ├── __init__.py
│   ├── __main__.py
│   ├── menu.py
│   └── pizza.py
└── pizzashop
    ├── __init__.py
    └── shop.py

# pizzashop/shop.py

import pizzapy.menu
print(pizzapy.menu.MENU)

Now, sitting in the top level directory, if we try to invoke shop.py like this

$ python3 pizzashop/shop.py
Traceback (most recent call last):
  File "pizzashop/shop.py", line 1, in 
    import pizzapy.menu
ModuleNotFoundError: No module named 'pizzapy'

we get the error that our pizzapy module not found. But if we invoke it as a part of the package

$ python3 -m pizzashop.shop
pizza.py module name is pizza
[, , ]

it suddenly works. What the hell is going on here?

The explanation to this lies in the Python module search path and it’s greatly described in the documentation on modules.

Module search path is a list of directories (available at runtime as sys.path) that interpreter uses to locate modules. It is initialized with the path to Python standard modules (/usr/lib64/python3.6), site-packages where pip puts everything you install globally, and also a directory that depends on how you run a module. If you run a module as a file like python3 pizzashop/shop.py the path to containing directory (pizzashop) is added to sys.path. Otherwise, including running with -m option, the current directory (as in pwd) is added to module search path. We can check it by printing sys.path in pizzashop/shop.py:

$ pwd
/home/avd/dev/python-imports

$ tree
.
├── pizzapy
│   ├── __init__.py
│   ├── __main__.py
│   ├── menu.py
│   └── pizza.py
└── pizzashop
    ├── __init__.py
    └── shop.py

$ python3 pizzashop/shop.py
['/home/avd/dev/python-imports/pizzashop',
 '/usr/lib64/python36.zip',
 '/usr/lib64/python3.6',
 '/usr/lib64/python3.6/lib-dynload',
 '/usr/local/lib64/python3.6/site-packages',
 '/usr/local/lib/python3.6/site-packages',
 '/usr/lib64/python3.6/site-packages',
 '/usr/lib/python3.6/site-packages']
Traceback (most recent call last):
  File "pizzashop/shop.py", line 5, in 
    import pizzapy.menu
ModuleNotFoundError: No module named 'pizzapy'

$ python3 -m pizzashop.shop
['',
 '/usr/lib64/python36.zip',
 '/usr/lib64/python3.6',
 '/usr/lib64/python3.6/lib-dynload',
 '/usr/local/lib64/python3.6/site-packages',
 '/usr/local/lib/python3.6/site-packages',
 '/usr/lib64/python3.6/site-packages',
 '/usr/lib/python3.6/site-packages']
pizza.py module name is pizza
[, , ]

As you can see in the first case we have the pizzashop dir in our path and so we cannot find sibling pizzapy package, while in the second case the current dir (denoted as '') is in sys.path and it contains both packages.

Python has module search path available at runtime as sys.path
If you run a module as a script file, the containing directory is added to sys.path, otherwise, the current directory is added to it

This problem of importing the sibling package often arise when people put a bunch of test or example scripts in a directory or package next to the main package. Here is a couple of StackOverflow questions:

The good solution is to avoid the problem – put tests or examples in the package itself and use relative import. The dirty solution is to modify sys.path at runtime (yay, dynamic!) by adding the parent directory of the needed package. People actually do this despite it’s an awful hack.

The End!

I hope that after reading this post you’ll have a better understanding of Python imports and could finally decompose that giant script you have in your toolbox without fear. In the end, everything in Python is really simple and even when it is not sufficient to your case, you can always monkey patch anything at runtime.

And on that note, I would like to stop and thank you for your attention. Until next time!

Write your own diff for fun

2017-12-27T00:00:00+00:00

On the other day, when I was looking at git diff, I thought “How does it work?”. Brute-force idea of comparing all possible pairs of lines doesn’t seem efficient and indeed it has exponential algorithmic complexity. There must be a better way, right?

As it turned out, git diff, like a usual diff tool is modeled as a solution to a problem called Longest Common Subsequence. The idea is really ingenious – when we try to diff 2 files we see it as 2 sequences of lines and try to find a Longest Common Subsequence. Then anything that is not in that subsequence is our diff. Sounds neat, but how can one implement it in an effective way (without that exponential complexity)?

LCS problem is a classic problem that is better solved with dynamic programming – somewhat advanced technique in algorithm design that roughly means an iteration with memoization.

I’ve always struggled with dynamic programming because it’s mostly presented through some (in my opinion) artificial problem that is hard for me to work on. But now, when I see something so useful that can help me write a diff, I just can’t resist.

I used a Wikipedia article on LCS as my guide, so if you want to check the algorithm nitty-gritty, go ahead to the link. I’m going to show you my implementation (that is, of course, available on GitHub) to demonstrate how easily you can solve such seemingly hard problem.

I’ve chosen Python to implement it and immediately felt grateful because you can copy-paste pseudocode and use it with minimal changes. Here is the diff printing function from Wikipedia article in pseudocode:

function printDiff(C[0..m,0..n], X[1..m], Y[1..n], i, j)
    if i > 0 and j > 0 and X[i] = Y[j]
        printDiff(C, X, Y, i-1, j-1)
        print "  " + X[i]
    else if j > 0 and (i = 0 or C[i,j-1] ≥ C[i-1,j])
        printDiff(C, X, Y, i, j-1)
        print "+ " + Y[j]
    else if i > 0 and (j = 0 or C[i,j-1] < C[i-1,j])
        printDiff(C, X, Y, i-1, j)
        print "- " + X[i]
    else
        print ""

And in Python:

def print_diff(c, x, y, i, j):
    """Print the diff using LCS length matrix by backtracking it"""

    if i >= 0 and j >= 0 and x[i] == y[j]:
        print_diff(c, x, y, i-1, j-1)
        print("  " + x[i])
    elif j >= 0 and (i == 0 or c[i][j-1] >= c[i-1][j]):
        print_diff(c, x, y, i, j-1)
        print("+ " + y[j])
    elif i >= 0 and (j == 0 or c[i][j-1] < c[i-1][j]):
        print_diff(c, x, y, i-1, j)
        print("- " + x[i])
    else:
        print("")

This is not the actual function for my diff printing because it doesn’t handle few corner cases – it’s just to illustrate Python awesomeness.

The essence of diffing is building the matrix C which contains lengths for all subsequences. Building it may seem daunting until you start looking at the simple cases:

LCS of “A” and “A” is “A”.
LCS of “AA” and “AB” is “A”.
LCS of “AAA” and “ABA” is “AA”.

Building iteratively we can define the LCS function:

LCS of 2 empty sequences is the empty sequence.
LCS of “${prefix1}A” and “${prefix2}A” is LCS(${prefix1}, ${prefix2}) + A
LCS of “${prefix1}A” and “${prefix2}B” is the longest of LCS(${prefix1}A, ${prefix2}) and LCS(${prefix1}, ${prefix2}B)

That’s basically the core of dynamic programming – building the solution iteratively starting from the simple base cases. Note, though, that it’s working only when the problem has so-called “optimal” structure, meaning that it can be built by reusing previous memoized steps.

Here is the Python function that builds that length matrix for all subsequences:

def lcslen(x, y):
    """Build a matrix of LCS length.

    This matrix will be used later to backtrack the real LCS.
    """

    # This is our matrix comprised of list of lists.
    # We allocate extra row and column with zeroes for the base case of empty
    # sequence. Extra row and column is appended to the end and exploit
    # Python's ability of negative indices: x[-1] is the last elem.
    c = [[0 for _ in range(len(y) + 1)] for _ in range(len(x) + 1)]

    for i, xi in enumerate(x):
        for j, yj in enumerate(y):
            if xi == yj:
                c[i][j] = 1 + c[i-1][j-1]
            else:
                c[i][j] = max(c[i][j-1], c[i-1][j])
    return c

Having the matrix of LCS lengths we can now build the actual LCS by backtracking it.

def backtrack(c, x, y, i, j):
    """Backtrack the LCS length matrix to get the actual LCS"""

    if i == -1 or j == -1:
        return ""
    elif x[i] == y[j]:
        return backtrack(c, x, y, i-1, j-1) + x[i]
    else:
        if c[i][j-1] > c[i-1][j]:
            return backtrack(c, x, y, i, j-1)
        else:
            return backtrack(c, x, y, i-1, j)

But for diff we don’t need the actual LCS, we need the opposite. So diff printing is actually slightly changed backtrack function with 2 additional cases for changes in the head of sequence:

def print_diff(c, x, y, i, j):
    """Print the diff using LCS length matrix by backtracking it"""

    if i < 0 and j < 0:
        return ""
    elif i < 0:
        print_diff(c, x, y, i, j-1)
        print("+ " + y[j])
    elif j < 0:
        print_diff(c, x, y, i-1, j)
        print("- " + x[i])
    elif x[i] == y[j]:
        print_diff(c, x, y, i-1, j-1)
        print("  " + x[i])
    elif c[i][j-1] >= c[i-1][j]:
        print_diff(c, x, y, i, j-1)
        print("+ " + y[j])
    elif c[i][j-1] < c[i-1][j]:
        print_diff(c, x, y, i-1, j)
        print("- " + x[i])

To invoke it we read input files into Python lists of strings and pass it to our diff functions. We also add some usual Python stanza:

def diff(x, y):
    c = lcslen(x, y)
    return print_diff(c, x, y, len(x)-1, len(y)-1)

def usage():
    print("Usage: {}  ".format(sys.argv[0]))

def main():
    if len(sys.argv) != 3:
        usage()
        sys.exit(1)

    with open(sys.argv[1], 'r') as f1, open(sys.argv[2], 'r') as f2:
        diff(f1.readlines(), f2.readlines())

if __name__ == '__main__':
    main()

And there you go:

$ python3 diff.py f1 f2
+ """Simple diff based on LCS solution"""
+ 
+ import sys
  from lcs import lcslen
  
  def print_diff(c, x, y, i, j):
+     """Print the diff using LCS length matrix by backtracking it"""
+ 
       if i >= 0 and j >= 0 and x[i] == y[j]:
           print_diff(c, x, y, i-1, j-1)
           print("  " + x[i])
       elif j >= 0 and (i == 0 or c[i][j-1] >= c[i-1][j]):
           print_diff(c, x, y, i, j-1)
-          print("+ " +  y[j])
+          print("+ " + y[j])
       elif i >= 0 and (j == 0 or c[i][j-1] < c[i-1][j]):
           print_diff(c, x, y, i-1, j)
           print("- " + x[i])
       else:
-          print("")
- 
+         print("")  # pass?

You can check out the full source code at https://github.com/alexdzyoba/diff.

That’s it. Until next time!

Go service with Consul integration

2017-12-14T00:00:00+00:00

In the world of stateless microservices, which are usually written in Go, we need to discover them. This is where Hashicorp’s Consul helps. Services register within Consul so other services can discover them via simple DNS or HTTP queries.

Go has a Consul client library, alas, I didn’t see any real examples of how to integrate it into your services. So here I’m going to show you how to do exactly this.

I’m going to write a service that will serve at some HTTP endpoint and will serve key-value data – I believe this resembles a lot of existing microservices that people write these days. Ours is called webkv and it’s on Github. Choose the “v1” tag and you’re good to go.

This service will register itself in Consul with TTL check that will, well, check internal health status and send a heartbeat like signals to Consul. Should Consul not receive a signal from our service within a TTL interval it will mark it as failed and remove it from queries results.

Side note: Consul has also simple port checks when Consul agent will judge the health of the service based on the port availability. While it’s much simpler, e.g. you don’t have to add anything to your code, it’s not that powerful as a TTL check. With TTL checks you can inspect internal state of your service which is a huge advantage in comparison with simple availability – you can accept queries but your data may be stale or invalid. Also, with TTL checks service status can be not only in binary state – good/bad – but also with a warning.

All right, to the point! The “v1” version of webkv uses only the standard library and the bare minimum of dependencies like Redis client and Consul API lib. Later I’m going to extend it with other niceties like Prometheus integration, structured logging, and sane configuration management.

Basic Web service

Let’s start with a basic web service that will serve key-value data from Redis.

First, parse port, ttl, and addrs commandline flags. The last one is the list of Redis addresses separated with ;.

func main() {
	port := flag.Int("port", 8080, "Port to listen on")
	addrsStr := flag.String("addrs", "", "(Required) Redis addrs (may be delimited by ;)")
	ttl := flag.Duration("ttl", time.Second*15, "Service TTL check duration")
	flag.Parse()

	if len(*addrsStr) == 0 {
		fmt.Fprintln(os.Stderr, "addrs argument is required")
		flag.PrintDefaults()
		os.Exit(1)
	}

	addrs := strings.Split(*addrsStr, ";")

Now, we create a service that should implement Handler interface and launch it.

	s, err := service.New(addrs, *ttl)
	if err != nil {
		log.Fatal(err)
	}
	http.Handle("/", s)

	l := fmt.Sprintf(":%d", *port)
	log.Print("Listening on ", l)
	log.Fatal(http.ListenAndServe(l, nil))

Nothing fancy here. Now let’s look at the service itself.

import (
	"time"

	"github.com/go-redis/redis"
)

type Service struct {
	Name        string
	TTL         time.Duration
	RedisClient redis.UniversalClient
}

The Service is a type that holds a name, TTL and Redis client handler. It’s instantiated like this:

func New(addrs []string, ttl time.Duration) (*Service, error) {
	s := new(Service)
	s.Name = "webkv"
	s.TTL = ttl
	s.RedisClient = redis.NewUniversalClient(&redis.UniversalOptions{
		Addrs: addrs,
	})

	ok, err := s.Check()
	if !ok {
		return nil, err
	}

	return s, nil
}

Check method issues PING Redis command to check if we’re ok. This will be used later with Consul registration.

func (s *Service) Check() (bool, error) {
	_, err := s.RedisClient.Ping().Result()
	if err != nil {
		return false, err
	}
	return true, nil

And now the implementation of ServeHTTP method that will be invoked for request processing:

func (s *Service) ServeHTTP(w http.ResponseWriter, r *http.Request) {
	status := 200

	key := strings.Trim(r.URL.Path, "/")
	val, err := s.RedisClient.Get(key).Result()
	if err != nil {
		http.Error(w, "Key not found", http.StatusNotFound)
		status = 404
	}

	fmt.Fprint(w, val)
	log.Printf("url=\"%s\" remote=\"%s\" key=\"%s\" status=%d\n",
		r.URL, r.RemoteAddr, key, status)
}

Basically, what we do is retrieve URL path from request and use it as a key for Redis “GET” command. After that we return the value or 404 in case of an error. Last, we log the request with a quick and dirty structured logging message in logfmt format.

Launch it:

$ ./webkv -addrs 'localhost:6379'
2017/12/13 21:44:15 Listening on :8080

Query it:

$ curl 'localhost:8080/blink'
182

And see the log message:

2017/12/13 21:44:29 url="/blink" remote="[::1]:35020" key="blink" status=200

Consul integration

Now let’s make our service discoverable via Consul. Consul has simple HTTP API to register services that you can employ directly via “net/http” but we will use its Go library.

Consul Go library doesn’t have examples, BUT, it has tests! Tests are nice not only because it gives you confidence in your lib, approval for the sanity of your code structure and API and, finally, a set of usage examples. Here is an example from Consul API test suite for service registration and TTL checks.

Looking at these tests, we can tell that we interact with Consul by creating a Client and then getting a handle for the particular endpoint like /agent or /kv. For each endpoint, there is a corresponding Go type. Agent endpoint is responsible for service registration and sending health checks. To store an Agent handle we extend our Service type with a new pointer:

import (
	consul "github.com/hashicorp/consul/api"
)

type Service struct {
	Name        string
	TTL         time.Duration
	RedisClient redis.UniversalClient
	ConsulAgent *consul.Agent
}

Next in the Service “constructor” we add the creation of Consul agent handle:

func New(addrs []string, ttl time.Duration) (*Service, error) {
    ...
	c, err := consul.NewClient(consul.DefaultConfig())
	if err != nil {
		return nil, err
	}
	s.ConsulAgent = c.Agent()

Next, we use the agent to register our service:

	serviceDef := &consul.AgentServiceRegistration{
		Name: s.Name,
		Check: &consul.AgentServiceCheck{
			TTL: s.TTL.String(),
		},
	}

	if err := s.ConsulAgent.ServiceRegister(serviceDef); err != nil {
		return nil, err
	}

The key thing here is the Check part where we tell Consul how it should check our service. In our case, we say that we ourselves will send heartbeat-like signals to Consul so that it will mark our service failed after TTL. Failed service is not returned as part of DNS or HTTP API queries.

After service is registered we have to send a TTL check signal with Pass, Fail or Warn type. We have to send it periodically and in time to avoid service failure by TTL. We’ll do it in a separate goroutine:

	go s.UpdateTTL(s.Check)

UpdateTTL method uses time.Ticker to periodically invoke the actual update function:

func (s *Service) UpdateTTL(check func() (bool, error)) {
	ticker := time.NewTicker(s.TTL / 2)
	for range ticker.C {
		s.update(check)
	}
}

check argument is a function that returns a service status. Based on its result we send either pass or fail check:

func (s *Service) update(check func() (bool, error)) {
	ok, err := check()
	if !ok {
		log.Printf("err=\"Check failed\" msg=\"%s\"", err.Error())
		if agentErr := s.ConsulAgent.FailTTL("service:"+s.Name, err.Error()); agentErr != nil {
			log.Print(agentErr)
		}
	} else {
		if agentErr := s.ConsulAgent.PassTTL("service:"+s.Name, ""); agentErr != nil {
			log.Print(agentErr)
		}
	}
}

Check function that we pass to goroutine is the one we used earlier on creating service, it just returns bool status of Redis PING command.

And that’s it! This is how it all works together:

We launch the webkv
It connects to Redis and start serving at given port
It connects to Consul agent and register service with TTL check
Every TTL/2 seconds we check service status by PINGing Redis and send Pass check
Should Redis connectivity fail we detect it and send a Fail check that will remove our service instance from DNS and HTTP query to avoid returning errors or invalid data

To see it in action you need to launch a Consul and Redis. You can launch Consul with consul agent -dev or start a normal cluster. How to launch Redis depends on your distro, in my Fedora it’s just systemctl start redis.

Now launch the webkv like this:

$ ./webkv -addrs localhost:6379 -port 8888
2017/12/14 19:00:29 Listening on :8888

Query the Consul for services:

$ dig +noall +answer @127.0.0.1 -p 8600 webkv.service.dc1.consul
webkv.service.dc1.consul. 0     IN      A       127.0.0.1

$ curl localhost:8500/v1/health/service/webkv?passing
[
    {
        "Node": {
            "ID": "a4618035-c73d-9e9e-2b83-24ece7c24f45",
            "Node": "alien",
            "Address": "127.0.0.1",
            "Datacenter": "dc1",
            "TaggedAddresses": {
                "lan": "127.0.0.1",
                "wan": "127.0.0.1"
            },
            "Meta": {
                "consul-network-segment": ""
            },
            "CreateIndex": 5,
            "ModifyIndex": 6
        },
        "Service": {
            "ID": "webkv",
            "Service": "webkv",
            "Tags": [],
            "Address": "",
            "Port": 0,
            "EnableTagOverride": false,
            "CreateIndex": 15,
            "ModifyIndex": 37
        },
        "Checks": [
            {
                "Node": "alien",
                "CheckID": "serfHealth",
                "Name": "Serf Health Status",
                "Status": "passing",
                "Notes": "",
                "Output": "Agent alive and reachable",
                "ServiceID": "",
                "ServiceName": "",
                "ServiceTags": [],
                "Definition": {},
                "CreateIndex": 5,
                "ModifyIndex": 5
            },
            {
                "Node": "alien",
                "CheckID": "service:webkv",
                "Name": "Service 'webkv' check",
                "Status": "passing",
                "Notes": "",
                "Output": "",
                "ServiceID": "webkv",
                "ServiceName": "webkv",
                "ServiceTags": [],
                "Definition": {},
                "CreateIndex": 15,
                "ModifyIndex": 141
            }
        ]
    }
]

Now if we stop the Redis we’ll see the log messages

...
2017/12/14 19:29:19 err="Check failed" msg="EOF"
2017/12/14 19:29:27 err="Check failed" msg="dial tcp [::1]:6379: getsockopt: connection refused"
...

and that Consul doesn’t return our service:

$ dig +noall +answer @127.0.0.1 -p 8600 webkv.service.dc1.consul
$ # empty reply

$ curl localhost:8500/v1/health/service/webkv?passing
[]

Starting Redis again will make service healthy.

So, basically this is it – the basic Web service with Consul integration for service discovery and health checking. Check out the full source code at github.com/alexdzyoba/webkv. Next time we’ll add metrics export for monitoring our service with Prometheus.

Packer + Ansible - Dockerfile = AwesomeContainer

2017-12-03T00:00:00+00:00

As a trendy software engineer, I use Docker because it’s a nice way to try software without environment setup hassle. But as an SRE/DevOps kinda guy I also create my own images – for CI environment, for experimenting and sometimes even for production.

We all know that Docker images are built with Dockerfiles but in my not so humble opinion, Dockerfiles are silly - they are fragile, makes bloated images and look like crap. For me, building Docker images was tedious and grumpy work until I’ve found Ansible. The moment when you have your first Ansible playbook work you’ll never look back. I immediately felt grateful for Ansible’s simple automation tools and I started to use Ansible to provision Docker containers. During that time I’ve found Ansible Container project and tried to use it but in 2016 it was not ready for me. Soon after I’ve found Hashicorp’s Packer that has Ansible provisioning support and from that moment I use this powerful combo to build all of my Docker images.

Hereafter, I want to show you an example of how it all works together, but first let’s return to my point about Dockerfiles.

Why Dockerfiles are silly

In short, because each line in Dockerfile creates a new layer. While it’s awesome to see the layered fs and be able to reuse the layers for other images, in reality, it’s madness. Your images size grows without control and now you have a 2GB image for a python app, and 90% of your layers are not reused. So, actually, you don’t need all these layers.

To squash layers, you either use do some additional steps like invoking docker-squash or you have to give as little commands as possible. And that’s why in real production Dockerfiles we see way too much &&s because chaining RUN commands with && will create a single layer.

To illustrate my point, look at the 2 Dockerfiles for the one of the most popular docker images – Redis and nginx. The main part of these Dockerfiles is the giant chain of commands with newline escaping, inplace config patching with sed and cleanup as the last command.

RUN set -ex; \
	\
	buildDeps=' \
		wget \
		\
		gcc \
		libc6-dev \
		make \
	'; \
	apt-get update; \
	apt-get install -y $buildDeps --no-install-recommends; \
	rm -rf /var/lib/apt/lists/*; \
	\
	wget -O redis.tar.gz "$REDIS_DOWNLOAD_URL"; \
	echo "$REDIS_DOWNLOAD_SHA *redis.tar.gz" | sha256sum -c -; \
	mkdir -p /usr/src/redis; \
	tar -xzf redis.tar.gz -C /usr/src/redis --strip-components=1; \
	rm redis.tar.gz; \
	\
# disable Redis protected mode [1] as it is unnecessary in context of Docker
# (ports are not automatically exposed when running inside Docker, but rather explicitly by specifying -p / -P)
# [1]: https://github.com/antirez/redis/commit/edd4d555df57dc84265fdfb4ef59a4678832f6da
	grep -q '^#define CONFIG_DEFAULT_PROTECTED_MODE 1$' /usr/src/redis/src/server.h; \
	sed -ri 's!^(#define CONFIG_DEFAULT_PROTECTED_MODE) 1$!\1 0!' /usr/src/redis/src/server.h; \
	grep -q '^#define CONFIG_DEFAULT_PROTECTED_MODE 0$' /usr/src/redis/src/server.h; \
# for future reference, we modify this directly in the source instead of just supplying a default configuration flag because apparently "if you specify any argument to redis-server, [it assumes] you are going to specify everything"
# see also https://github.com/docker-library/redis/issues/4#issuecomment-50780840
# (more exactly, this makes sure the default behavior of "save on SIGTERM" stays functional by default)
	\
	make -C /usr/src/redis -j "$(nproc)"; \
	make -C /usr/src/redis install; \
	\
	rm -r /usr/src/redis; \
	\
	apt-get purge -y --auto-remove $buildDeps

All of this madness is for the sake of avoiding layers creation. And that’s where I want to ask a question – is this the best way to do things in 2017? Really? For me, all these Dockerfiles looks like a poor man’s bash script. And gosh, I hate bash. But on the other hand, I like containers, so I need a neat way to fight this insanity.

Ansible in Dockerfile

Instead of putting raw bash commands we can write a reusable Ansible role invoke it from the playbook that will be used inside Docker container to provision it.

This is how I do it

FROM debian:9

# Bootstrap Ansible via pip
RUN apt-get update && apt-get install -y wget gcc make python python-dev python-setuptools python-pip libffi-dev libssl-dev libyaml-dev
RUN pip install -U pip
RUN pip install -U ansible

# Prepare Ansible environment
RUN mkdir /ansible
COPY . /ansible
ENV ANSIBLE_ROLES_PATH /ansible/roles
ENV ANSIBLE_VAULT_PASSWORD_FILE /ansible/.vaultpass

# Launch Ansible playbook from inside container
RUN cd /ansible && ansible-playbook -c local -v mycontainer.yml

# Cleanup
RUN rm -rf /ansible
RUN for dep in $(pip show ansible | grep Requires | sed 's/Requires: //g; s/,//g'); do pip uninstall -y $dep; done
RUN apt-get purge -y python-dev python-pip
RUN apt-get autoremove -y && apt-get autoclean -y && apt-get clean -y
RUN rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp* /usr/share/doc/*

# Environment setup
ENV HOME /home/test
WORKDIR /
USER test

CMD ["/bin/bash"]

Drop this Dockerfile to the root of your Ansible repo and it will build Docker image using your playbooks, roles, inventory and vault secrets.

It works, it’s reusable, e.g. I have some base roles that applied for docker container and on bare metal machines, provisioning is easier to maintain in Ansible. But still, it feels awkward.

Packer with Ansible provisioner

So I went a step further and started to use Packer. Packer is a tool specifically built for creating of machine images. It can be used not only to build container image but VM images for cloud providers like AWS and GCP.

It immediately hooked me with these lines in the documentation:

Packer builds Docker containers without the use of Dockerfiles. By not using Dockerfiles, Packer is able to provision containers with portable scripts or configuration management systems that are not tied to Docker in any way. It also has a simple mental model: you provision containers much the same way you provision a normal virtualized or dedicated server.

That’s what I wanted to achieve previously with my Ansiblized Dockerfiles.

So let’s see how we can build Redis image that is almost identical to the official.

Building Redis image with Packer and Ansible

First, let’s create a playground dir

$ mkdir redis-packer && cd redis-packer

Packer is controlled with a declarative configuration in JSON format. Here is ours:

{
    "builders": [{
        "type": "docker",
        "image": "debian:jessie-slim",
        "commit": true,
        "changes": [
            "VOLUME /data",
            "WORKDIR /data",
            "EXPOSE 6379",
            "ENTRYPOINT [\"docker-entrypoint.sh\"]",
            "CMD [\"redis-server\"]"
        ]
    }],

    "provisioners": [{
        "type": "ansible",
        "user": "root",
        "playbook_file": "provision.yml"
    }],

    "post-processors": [[ {
        "type": "docker-tag",
        "repository": "docker.io/alexdzyoba/redis-packer",
        "tag": "latest"
    } ]]
}

Put this in redis.json file and let’s figure out what all of this means.

First, we describe our builders – what kind of image we’re going to build. In our case, it’s a Docker image based on debian:jessie-slim. commit: true tells that after all the setup we want to have changes committed. The other option is export to tar archive with the export_path option.

Next, we describe our provisioner and that’s where Ansible will step in the game. Packer has support for Ansible in 2 modes – local and remote.

Local mode ("type": "ansible-local") means that Ansible will be launched inside the Docker container – just like my previous setup. But Ansible won’t be installed by Packer so you have to do this by yourself with shell provisioner – similar to my Ansible bootstrapping in Dockerfile.

Remote mode means that Ansible will be run on your build host and connect to the container via SSH, so you don’t need a full-blown Ansible installed in Docker container – just a Python interpreter.

So, I’m using remote Ansible that will connect as root user and launch provision.yml playbook.

After provisioning is done, Packer does post-processing. I’m doing just the tagging of the image but you can also push to the Docker registry.

Now let’s see the provision.yml playbook:

---

- name: Provision Python
  hosts: all
  gather_facts: no
  tasks:
    - name: Boostrap python
      raw: test -e /usr/bin/python || (apt-get -y update && apt-get install -y python-minimal)

- name: Provision Redis
  hosts: all

  tasks:
    - name: Ensure Redis configured with role
      import_role:
        name: alexdzyoba.redis

    - name: Create workdir
      file:
        path: /data
        state: directory
        owner: root
        group: root
        mode: 0755

    - name: Put runtime programs
      copy:
        src: files/{{ item }}
        dest: /usr/local/bin/{{ item }}
        mode: 0755
        owner: root
        group: root
      with_items:
        - gosu
        - docker-entrypoint.sh

- name: Container cleanup
  hosts: all
  gather_facts: no
  tasks:
    - name: Remove python
      raw: apt-get purge -y python-minimal && apt-get autoremove -y

    - name: Remove apt lists
      raw: rm -rf /var/lib/apt/lists/*

The playbook consists of 3 plays:

Provision Python for Ansible
Provision Redis using my role
Container cleanup

To provision container (or any other host) for Ansible, we need to install Python. But how install Python via Ansible for Ansible? There is a special Ansible raw module for exactly this case – it doesn’t require Python interpreter because it does bare shell commands over SSH. We need to invoke it with gather_facts: no to skip invoking facts gathering which is done in Python.

Redis provisioning is done with my Ansible role that does exactly the same steps as in official Redis Dockerfile – it creates redis user and group, it downloads source tarball, disables protected mode, compile it and do the afterbuild cleanup. Check out the details on Github.

Finally, we do the container cleanup by removing Python and cleaning up package management stuff.

There are only 2 things left – gosu and docker-entrypoint.sh files. These files along with Packer config and Ansible role are available at my redis-packer Github repo

Finally, all we do is launch it like this

$GOPATH/bin/packer build redis.json

You can see example output in this gist

In the end, we got an image that is even a bit smaller than official:

$ docker images
REPOSITORY                                TAG                 IMAGE ID            CREATED             SIZE
docker.io/alexdzyoba/redis-packer         latest              05c7aebe901b        3 minutes ago       98.9 MB
docker.io/redis                           3.2                 d3f696a9f230        4 weeks ago         99.7 MB

Any drawbacks?

Of course, my solution has its own drawbacks. First, you have to learn new tools – Packer and Ansible. But I strongly advise for learning Ansible, because you’ll need it for other kinds of automation in your projects. And you DO automate your tasks, right?

The second drawback is that now container building is more involved with all the packer config, ansible roles and playbooks and stuff. Counting by the lines of code there are 174 lines now

$ (find alexdzyoba.redis -type f -name '*.yml' -exec cat {} \; && cat redis.json provision.yml) | wc -l
174

While originally it was only 77:

$ wc -l Dockerfile
77 Dockerfile

And again I would advise you to go this path because:

It’s reusable. You can apply the Redis role not only for the container but also for your EC2 instance or bare metal service or pretty much anything that runs Linux with SSH.
It’s maintainable. Come back few month later and you’ll still understand what’s going on because Packer config, playbook and role is structured and even commented. And you build the image with a simple packer build redis.json command to produce ready and tagged image.
It’s extensible. You can use pretty much the same role to provision Redis version 4.0.5 by simply passing redis_version and redis_download_sha variables. No new Dockerfile needed.

Conclusion

So that’s my Docker image building setup for now. It works well for me and I kinda enjoy the process now. I would also like to look at Ansible Container again but that will be another post, so stay tuned – this blog has Atom feed and I also post on twitter @AlexDzyoba

Setting RabbitMQ cluster via config

2017-11-25T00:00:00+00:00

RabbitMQ is the most popular AMQP broker or in other simple words - queue server. People use it to queue and track heavy processing, distribute tasks among workers, buffer incoming messages to handle spikes and for many other use cases.

This sounds like a very important part of your infrastructure, so you are better off making it highly available and RabbitMQ has clustering support for this case.

Now there are 2 ways to make a RabbitMQ cluster. One is by hand with rabbitmqctl join_cluster as described in the documentation. And the other one is via config file.

I haven’t seen the latter case described anywhere so I’ll do it myself in this post.

Most of the things I’ll describe here is automated in my rabbitmq-cluster Ansible role.

Suppose you have somehow installed RabbitMQ server on 3 nodes. It has started and now you have a 3 independent RabbitMQ instances.

To make it a cluster you first stop all of the 3 instances. You have to do this because, once set up, RabbitMQ configuration (including cluster) is persistent in mnesia files and will try to build a cluster using its own internal facilities.

Having it stopped you have to clear mnesia base dir like this rm -rf $MNESIA_BASE/*. Again, you need this to clear any previous configuration (usually broken from previous failed attempts).

Now is the meat of it. On each node open the /etc/rabbitmq/rabbitmq.config and add the list of cluster nodes:

{cluster_nodes, {['rabbit@rabbit1', 'rabbit@rabbit2', 'rabbit@rabbit3'], disc}},

Next, again on each node, create file /var/lib/rabbitmq/.erlang.cookie and add some string to it. It can really be anything unless it’s identical on all nodes in the cluster. This file must have 0600 permissions and owner, group of rabbitmq server process.

Now we are ready to start the cluster. But hold on. To make it work you MUST start nodes by one, not simultaneously. Because otherwise cluster won’t be created. This is a workaround for some strange that I found in mailing list here.

I hit this one 2 times - one when I configured my RabbitMQ nodes via tmux in synchronized panes, and the other when I was writing Ansible role.

But in the end, I’ve got a very nice cluster with sane production config values that you can check out in defaults of my role

That’s it. Untill next time!

Distributing go binaries with fpm

2017-11-06T00:00:00+00:00

Go has a nice tooling — build system, cross-compilation, dependency management and even formatting tools. And in the end, you get a single binary.

Now having a single binary how to distribute it on servers? I mean, how you solve the following problems:

Identifying what version is currently run in prod?
Upgrading the binary?
Downgrading to the known version?
Distributing extra stuff with a binary — data files, service definitions and stuff?

The common to all of the problems above is versioning. You need to assign and track the version of your Go program to keep the sanity in the prod.

One of the solutions is docker — you put the binary into the scratch image, put anything you want along with the binary, tag the image, upload it to the registry and then use it on the server with docker tools.

It sounds reasonable and trendy. But operating docker is not an easy walk. Networking with docker is hard, docker is breaking on upgrades, etc. Though in the long run, it could pay off because it’ll allow you to transition to some nice platform like Kubernetes.

But what if you don’t want to use docker? What if you don’t want to install the docker tools and keep the docker daemon running on your production just for the single binary?

If you don’t use docker then in case of golang you’re entering a hostile place. Go tooling gives you a solution in the form of go get. But go get only fetches from HEAD and requires you to manually use git to switch version and then invoke go build to rebuild the program. Also, keeping dev environment on the production infrastructure is stupid.

Instead, I have a much simpler and battle-tested solution — packages. Yes, the simple and familiar distro packages like “deb” and “rpm”. It has versions, it has good tooling allowing you to query, upgrade and downgrade packages, supply any extra data and even script the installations with things like postinst.

So the idea is to package the go binary as a package and install it on your infrastructure with package management utilities. Though building packages sometimes get scary, packaging a single file (with metadata) is really simple with the help of an amazing tool called fpm.

fpm allows you to create target package like “deb” or “rpm” from various sources like a plain directory, tarballs or other packages. Here is the list of sources and targets from github:

Sources:

gem (even autodownloaded for you)
python modules (autodownload for you)
pear (also downloads for you)
directories
tar(.gz) archives
rpm
deb
node packages (npm)
pacman (ArchLinux) packages

Targets:

deb
rpm
solaris
freebsd
tar
directories
Mac OS X .pkg files (osxpkg)
pacman (ArchLinux) packages

To package Go binaries we’ll use “directory” source and package it as “deb” and “rpm”.

Let’s start with “rpm”:

$ fpm -s dir -t rpm -n mypackage $GOPATH/bin/packer
Created package {:path=>"mypackage-1.0-1.x86_64.rpm"}

And that’s a valid package!

$ rpm -qipl mypackage-1.0-1.x86_64.rpm
Name        : mypackage
Version     : 1.0
Release     : 1
Architecture: x86_64
Install Date: (not installed)
Group       : default
Size        : 87687286
License     : unknown
Signature   : (none)
Source RPM  : mypackage-1.0-1.src.rpm
Build Date  : Mon 06 Nov 2017 07:54:47 PM MSK
Build Host  : airblade
Relocations : / 
Packager    : 
Vendor      : avd@airblade
URL         : http://example.com/no-uri-given
Summary     : no description given
Description :
no description given
/home/avd/go/bin/packer

You can see, though, that it put the file with the path as is, in my case under my $GOPATH. We can tell fpm where to put it on the target system like this:

$ fpm -f -s dir -t rpm -n mypackage $GOPATH/bin/packer=/usr/local/bin/
Force flag given. Overwriting package at mypackage-1.0-1.x86_64.rpm {:level=>:warn}
Created package {:path=>"mypackage-1.0-1.x86_64.rpm"}

$ rpm -qpl mypackage-1.0-1.x86_64.rpm
/usr/local/bin/packer

Now, that’s good.

By the way, because we made it as rpm package we got a 80% reduction in size due to package compression:

$ stat -c '%s' $GOPATH/bin/packer mypackage-1.0-1.x86_64.rpm
87687286
16097515

If you’re using deb-based distro all you have to do is change the target to the deb:

$ fpm -f -s dir -t deb -n mypackage $GOPATH/bin/packer=/usr/local/
bin/
Debian packaging tools generally labels all files in /etc as config files, as mandated by policy, so fpm defaults to this behavior for deb packages. You can disable this default behavior with --deb-no-default-config-files flag {:level=>:warn}
Created package {:path=>"mypackage_1.0_amd64.deb"}

$ dpkg-deb -I mypackage_1.0_amd64.deb
 new debian package, version 2.0.
 size 16317930 bytes: control archive=430 bytes.
     248 bytes,    11 lines      control              
     126 bytes,     2 lines      md5sums              
 Package: mypackage
 Version: 1.0
 License: unknown
 Vendor: avd@airblade
 Architecture: amd64
 Maintainer: 
 Installed-Size: 85632
 Section: default
 Priority: extra
 Homepage: http://example.com/no-uri-given
 Description: no description given

$ dpkg-deb -c mypackage_1.0_amd64.deb
drwxrwxr-x 0/0               0 2017-11-06 20:05 ./
drwxr-xr-x 0/0               0 2017-11-06 20:05 ./usr/
drwxr-xr-x 0/0               0 2017-11-06 20:05 ./usr/share/
drwxr-xr-x 0/0               0 2017-11-06 20:05 ./usr/share/doc/
drwxr-xr-x 0/0               0 2017-11-06 20:05 ./usr/share/doc/mypackage/
-rw-r--r-- 0/0             135 2017-11-06 20:05 ./usr/share/doc/mypackage/changelog.gz
drwxr-xr-x 0/0               0 2017-11-06 20:05 ./usr/local/
drwxr-xr-x 0/0               0 2017-11-06 20:05 ./usr/local/bin/
-rwxrwxr-x 0/0        87687286 2017-09-06 20:06 ./usr/local/bin/packer

Note, that I’m creating deb package on Fedora which is rpm-based distro!

Now you just upload the binary to your repo and you’re good to go.

Reference counting and garbage collection in Python

2017-09-03T00:00:00+00:00

A while ago I’ve read a nice story how Instagram disabled garbage collection for their Python apps and memory usage dropped and performance improved by 15%. This seems counter-intuitive at first but uncovers amazing details about Python (namely, CPython) memory management.

Instagram disabling garbage collection FTW!

Instagram is a Python/Django app that is running on uWSGI.

To run a Python app uWSGI master process forks and launch apps in a child process. This should’ve been leveraging the Copy-on-Write (CoW) mechanism in Linux - memory is shared among the processes as long as it’s not modified. And shared memory is good because it doesn’t waste the RAM (because it’s shared) and it improves cache hit ratio because multiple processes read the same memory. Apps that are launched by uWSGI are mostly identical because it’s the same code and so there should be a lot of memory shared between uWSGI master and child processes. But, instead, shared memory was dropping at the start of the process.

At first, they thought that it was because of reference counting because every read of an object, including immutable ones like code objects, causes write to the memory for that reference counters. But disabling reference counting didn’t prove that, so they went for profiling!

With the help of perf, they found out that it was the garbage collector that caused most of the page faults - the collect function.

So they decided to disable garbage collector because there is a reference counting that will still be used to free the memory. CPython provides a gc module that allows you to control garbage collection. Instagram guys found that it’s better to use gc.set_threshold(0) instead of gc.disable() because some library (like msgpack in their case) can reenable it back, but gc.set_threshold(0) is setting the collection frequency to zero effectively disabling it and also it’s immune to any subsequent gc.enable() calls.

This worked but the garbage collection was triggered at the exit of the child process and thrashed CPU for the whole minute which is useless because the process was about to be replaced by the new one. This can be dismissed in 2 ways:

Adding atexit.register(os._exit, 0). This tells that at the exit of your Python program just hard exit the process without further cleanup.
Use --skip-atexit-teardown option in the recent uWSGI.

With all these hacks the next things now happen:

uWSGI master process launch a handful of children for the application
GC is disabled when the child is starting up, so it’s not causing a lot of page faults, saving from CoW and allowing master and children to share much more memory and providing much higher CPU cache hit ratio
When a child dies it does its own cleanup but skips final GC saving shutdown time and preventing the useless CPU thrashing

(Python) memory management

What I’ve discovered from this story is that CPython has an interesting scheme for automatic memory management – it uses reference counting to release the memory that is no longer used and tracing generational garbage collector to fight cyclic objects.

So this is how reference counting works. Each object in Python has reference counter (ob_refcnt in the PyObject struct) - a special variable that is incremented when the object is referenced (e.g. added to the list or passed to the function) and decremented when it’s released. When the ref counter value is decremented to zero it’s released by the runtime.

Reference counting is a very nice and simple method for automatic memory management. It’s deterministic and avoids any background processing which makes it more efficient on the low power systems such as mobile devices.

But, unfortunately, it has some really bad flaws. First, it adds overhead for storing reference counter in every single object. Second, for multithreaded apps ref counting has to be atomic and thus must be synchronized between CPU cores which is slow. And finally, the references can form cycles which prevent counters from decrementing and such cyclic objects remains allocated forever.

Anyway, CPython uses reference counting as the main method for memory management. As for the drawbacks is not that scary in most cases. Memory overhead for storing ref counters is not really noticeable - even for million objects, it would be only 8 MiB (ref counter is ssize_t which is 8 bytes). Synchronization for ref counting is not applicable because CPython has Global Interpreter Lock (GIL).

The only problem left is fighting cycles. That’s why CPython periodically invokes tracing garbage collector. CPython’s GC is generational, i.e. it has 3 generations - 0, 1 and 2, where 0 is the youngest generation where all objects are born and 2 is the oldest generation where objects live until the process exits. Objects that are survived GC get moved to the next generation.

The idea of dividing the objects into generations is based on the heuristic that most of the objects that are allocated are short lived and so GC should try to free these objects more frequently than longer lived objects that are usually live forever.

All of these might seem complicated but I think it’s good tradeoff for CPython to employ such scheme. Some might say - why not leave only GC like most of the languages do? Well, GC has its own drawbacks. First, it must run in the background which in CPython not really possible because of GIL, so GC is a stop-the-world process. And second, because GC happens in the background, the exact time frame for object releases is undetermined.

So I think for CPython it’s a good balance to use ref counting and GC to complement each other.

In the end, CPython is not the only language/runtime that is using reference counting. Objective-C, Swift has compile time automatic reference counting (ARC). Remember that ref counting is more deterministic, so it is a huge win for iOS devices.

Rust also uses reference counting

C++ has smart pointers which basically are objects with reference counters, which are destructed by C++ runtime.

Many others languages like Perl and PHP also uses reference counting for memory management.

But, yeah, most of the languages now are based on pure GC:

Java/JVM
C#/CLR
Go
Haskell/GHC
Ruby/MRI
Many others like Lisp

Conclusion

CPython has an interesting scheme for managing memory - objects lifetime are managed by reference counting and to fight cycles it employs tracing garbage collector.

References

How to point GDB to your sources

2017-04-30T00:00:00+00:00

So, you have a binary that you or someone developed and, surprise, it has some bug. Or you just curious how it’s working. Great tool to help with these cases is a debugger.

It’s really seldom when you want to debug on assembly level, usually, you want to see the sources. But often times you debug the program on the host other than the build host and see this really frustrating message:

$ gdb -q python3.7
Reading symbols from python3.7...done.
(gdb) l
6	./Programs/python.c: No such file or directory.

Ouch. Everybody was here. I’ve seen this so often while it’s so vital for sensible debugging so I think it’s very important to get into details and understand how GDB shows source code in debugging session.

Debug info

It all starts with debug info - special sections in the binary file produced by the compiler and used by the debugger and other handy tools.

In GCC there is well-known -g flag for that. Most projects with some kind of build system either build with debug info by default or have some flag for it.

In the case of CPython, -g is added by default but nevertheless, we’re better off adding --with-pydebug to enable all kinds of debug options available in CPython:

$ ./configure --with-pydebug
$ make -j

While you’re watching the compilation log, notice the -g option in gcc invocations.

This -g option will generate debug sections - binary sections to insert into program’s binary. These sections are usually in DWARF format. For ELF binaries these debug sections have names like .debug_*, e.g. .debug_info or .debug_loc. These debug sections are what makes the magic of debugging possible - basically, it’s a mapping of assembly level instructions to the source code.

To find whether your program has debug symbols you can list the sections of the binary with objdump:

$ objdump -h ./python

python:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .interp       0000001c  0000000000400238  0000000000400238  00000238  2**0
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
  1 .note.ABI-tag 00000020  0000000000400254  0000000000400254  00000254  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, DATA
...
 25 .bss          00031f70  00000000008d9e00  00000000008d9e00  002d9dfe  2**5
                  ALLOC
 26 .comment      00000058  0000000000000000  0000000000000000  002d9dfe  2**0
                  CONTENTS, READONLY
 27 .debug_aranges 000017f0  0000000000000000  0000000000000000  002d9e56  2**0
                  CONTENTS, READONLY, DEBUGGING
 28 .debug_info   00377bac  0000000000000000  0000000000000000  002db646  2**0
                  CONTENTS, READONLY, DEBUGGING
 29 .debug_abbrev 0001fcd7  0000000000000000  0000000000000000  006531f2  2**0
                  CONTENTS, READONLY, DEBUGGING
 30 .debug_line   0008b441  0000000000000000  0000000000000000  00672ec9  2**0
                  CONTENTS, READONLY, DEBUGGING
 31 .debug_str    00031f18  0000000000000000  0000000000000000  006fe30a  2**0
                  CONTENTS, READONLY, DEBUGGING
 32 .debug_loc    0034190c  0000000000000000  0000000000000000  00730222  2**0
                  CONTENTS, READONLY, DEBUGGING
 33 .debug_ranges 00062e10  0000000000000000  0000000000000000  00a71b2e  2**0
                  CONTENTS, READONLY, DEBUGGING

or readelf:

$ readelf -S ./python
There are 38 section headers, starting at offset 0xb41840:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .interp           PROGBITS         0000000000400238  00000238
       000000000000001c  0000000000000000   A       0     0     1

...

  [26] .bss              NOBITS           00000000008d9e00  002d9dfe
       0000000000031f70  0000000000000000  WA       0     0     32
  [27] .comment          PROGBITS         0000000000000000  002d9dfe
       0000000000000058  0000000000000001  MS       0     0     1
  [28] .debug_aranges    PROGBITS         0000000000000000  002d9e56
       00000000000017f0  0000000000000000           0     0     1
  [29] .debug_info       PROGBITS         0000000000000000  002db646
       0000000000377bac  0000000000000000           0     0     1
  [30] .debug_abbrev     PROGBITS         0000000000000000  006531f2
       000000000001fcd7  0000000000000000           0     0     1
  [31] .debug_line       PROGBITS         0000000000000000  00672ec9
       000000000008b441  0000000000000000           0     0     1
  [32] .debug_str        PROGBITS         0000000000000000  006fe30a
       0000000000031f18  0000000000000001  MS       0     0     1
  [33] .debug_loc        PROGBITS         0000000000000000  00730222
       000000000034190c  0000000000000000           0     0     1
  [34] .debug_ranges     PROGBITS         0000000000000000  00a71b2e
       0000000000062e10  0000000000000000           0     0     1
  [35] .shstrtab         STRTAB           0000000000000000  00b416d5
       0000000000000165  0000000000000000           0     0     1
  [36] .symtab           SYMTAB           0000000000000000  00ad4940
       000000000003f978  0000000000000018          37   8762     8
  [37] .strtab           STRTAB           0000000000000000  00b142b8
       000000000002d41d  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), l (large)
  I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
  O (extra OS processing required) o (OS specific), p (processor specific)

as we see in our fresh compiled Python - it has .debug_* section, hence it has debug info.

Debug info is a collection of DIEs - Debug Info Entries. Each DIE has a tag specifying what kind of DIE it is and attributes that describes this DIE - things like variable name and line number.

How GDB finds source code

To find the sources GDB parses .debug_info section to find all DIEs with tag DW_TAG_compile_unit. The DIE with this tag has 2 main attributes DW_AT_comp_dir (compilation directory) and DW_AT_name - path to the source file. Combined they provide the full path to the source file for the particular compilation unit (object file).

To parse debug info you can again use objdump:

$ objdump -g ./python | vim -

and there you can see the parsed debug info:

Contents of the .debug_info section:

  Compilation Unit @ offset 0x0:
   Length:        0x222d (32-bit)
   Version:       4
   Abbrev Offset: 0x0
   Pointer Size:  8
 <0>: Abbrev Number: 1 (DW_TAG_compile_unit)
       DW_AT_producer    : (indirect string, offset: 0xb6b): GNU C99 6.3.1 20161221 (Red Hat 6.3.1-1) -mtune=generic -march=x86-64 -g -Og -std=c99
    <10>   DW_AT_language    : 12	(ANSI C99)
    <11>   DW_AT_name        : (indirect string, offset: 0x10ec): ./Programs/python.c
    <15>   DW_AT_comp_dir    : (indirect string, offset: 0x7a): /home/avd/dev/cpython
    <19>   DW_AT_low_pc      : 0x41d2f6
    <21>   DW_AT_high_pc     : 0x1b3
    <29>   DW_AT_stmt_list   : 0x0

It reads like this - for address range from DW_AT_low_pc = 0x41d2f6 to DW_AT_low_pc + DW_AT_high_pc = 0x41d2f6 + 0x1b3 = 0x41d4a9 source code file is the ./Programs/python.c located in /home/avd/dev/cpython. Pretty straightforward.

So this is what happens when GDB tries to show you the source code:

parses the .debug_info to find DW_AT_comp_dir with DW_AT_name attributes for the current object file (range of addresses)

opens the file at DW_AT_comp_dir/DW_AT_name

shows the content of the file to you

How to tell GDB where are the sources

So to fix our problem with ./Programs/python.c: No such file or directory. we have to obtain our sources on the target host (copy or git clone) and do one of the following:

1. Reconstruct the sources path

You can reconstruct the sources path on the target host, so GDB will find the source file where it expects. Stupid but it will work.

In my case, I can just do git clone https://github.com/python/cpython.git /home/avd/dev/cpython and checkout to the needed commit-ish.

2. Change GDB source path

You can direct GDB to the new source path right in the debug session with directory
command:

(gdb) list 6 ./Programs/python.c: No such file or directory. (gdb) directory /usr/src/python Source directories searched: /usr/src/python:$cdir:$cwd (gdb) list 6 #ifdef __FreeBSD__ 7 #include 8 #endif 9 10 #ifdef MS_WINDOWS 11 int 12 wmain(int argc, wchar_t **argv) 13 { 14 return Py_Main(argc, argv); 15 }

3. Set GDB substitution rule

Sometimes adding another source path is not enough if you have complex hierarchy. In this case you can add substitution rule for source path with set substitute-path GDB command.

(gdb) list 6 ./Programs/python.c: No such file or directory. (gdb) set substitute-path /home/avd/dev/cpython /usr/src/python (gdb) list 6 #ifdef __FreeBSD__ 7 #include 8 #endif 9 10 #ifdef MS_WINDOWS 11 int 12 wmain(int argc, wchar_t **argv) 13 { 14 return Py_Main(argc, argv); 15 }

4. Move binary to sources

You can trick GDB source path by moving binary to the directory with sources.

mv python /home/user/sources/cpython

This will work because GDB will try to look for sources in the current directory ($cwd) as the last resort.

5. Compile with -fdebug-prefix-map

You can substitute the source path on the build stage with -fdebug-prefix-map=old_path=new_path option. Here is how to do it within CPython project:

$ make distclean # start clean $ ./configure CFLAGS="-fdebug-prefix-map=$(pwd)=/usr/src/python" --with-pydebug $ make -j

And now we have new sources dir:

$ objdump -g ./python ... <0>: Abbrev Number: 1 (DW_TAG_compile_unit) DW_AT_producer : (indirect string, offset: 0xb65): GNU C99 6.3.1 20161221 (Red Hat 6.3.1-1) -mtune=generic -march=x86-64 -g -Og -std=c99 <10> DW_AT_language : 12 (ANSI C99) <11> DW_AT_name : (indirect string, offset: 0x10ff): ./Programs/python.c <15> DW_AT_comp_dir : (indirect string, offset: 0x558): /usr/src/python <19> DW_AT_low_pc : 0x41d336 <21> DW_AT_high_pc : 0x1b3 <29> DW_AT_stmt_list : 0x0 ...

This is the most robust way to do it because you can set it to something like /usr/src/, install sources there from a package and debug like a boss.

Conclusion

GDB uses debug info stored in DWARF format to find source level info. DWARF is pretty straightforward format - basically, it’s a tree of DIEs (Debug Info Entries) that describes object files of your programs along with variables and functions.

There are multiple ways to help GDB find sources, where the easiest ones are directory and set substitute-path commands, though -fdebug-prefix-map is really useful.

Now, when you have source level info go and explore something!

Resources

Introduction to the DWARF Debugging Format

GDB doc on source path

Installing Fedora on Macbook Air

2017-03-11T00:00:00+00:00

Preface

I never was a fan of laptops, I mean 2000s era laptops, the ones that were bulky, heavy and hard to upgrade. The last point was especially important to me because in the 2000s you had to upgrade your station, add more RAM, more HDD, and newer CPU. You followed Intel’s Tick-Tock schedule, chosen Tock ones, and got a performance boost (according to benchmarks).

But recently, all of a sudden I’ve realized that I have a 4-year-old machine with Intel i3 CPU and it’s fine. I don’t feel the need to upgrade. Partly it’s because I’m not using a Windows for a long time. On my Fedora, I mostly sit in the terminal without desktop environment like Gnome or KDE, edit text in Vim and that’s all I need. The heaviest thing on my machine - the browser - is working fine too, I can play a 1080p youtube video, I can load bloated sites.

The other part that saves me from the upgrade is that hardware itself is not improving vertically, but rather horizontally. Simply switching to the newer CPU will not make your computer life full of magic and unicorns - just compare Haswell and Kaby Lake CPUs. The only thing that increased in the clock rate and might gain you some performance is the bus speed that was increased from 5 GT/s to 8 GT/s. All the other things are about attaching more stuff on your CPU - more memory, more I/O devices. And the funny thing is that 3-year-old Haswell from 2014 costs the same $310 as new and shiny Kaby Lake. I’m not saying that the progress in CPUs has stopped, there is a servers market, there are a gaming market and HPC market that needs and feels all these developments. I’m saying that for consumer machines like desktops there is no need to upgrade often.

So there is a rare need to upgrade your machine now and recent laptops are nice, light and hold battery for at least 8 hours. So when I got an option to get a laptop at my job, I took it. The problem was that it was a Macbook Air.

And I’m a Linux guy, so I had to install Fedora on this stuff. I don’t care about you guys whining “…but macOS is so much better and friendly and nice and blah-blah…”. No. It’s not. Well, it’s not for me. I have a simple and efficient setup that serves me extremely well, looks gorgeous for me and don’t interfere with my work. It doesn’t mean that I didn’t try - I did, but working in macOS without tiling WM, strange keyboard shortcuts (you can’t set Alt-Shift to switch keyboard layout) and fake user-friendliness (I dare you to tell me how to show hidden files in Finder) make me dog slow.

So I’ve decided to install Fedora on Macbook Air and because it’s a little bit tricky, I wrote this guide. In the end, we’ll have a laptop with:

Dual boot macOS and Fedora

Working multimedia keys

Working brightness control including keyboard brightness

Working laptop lid close/open

Preparations

Because we’ll leave macOS we have to prepare Macbook. Thanks to the UEFI advancement in the Linux we don’t need rEFIt/rEFInd - modern distros are installed as a breeze. So the only thing we have to do is shrink macOS partition and prepare USB stick.

Make partition for Linux

My Macbook has only 128 GBs of SSD and I’ve decided to leave macOS on it, so I need to partition the drive leaving some usable amount of space for macOS. I don’t have any experience with macOS and thought that 40 GBs will be enough even if I will use it.

To partition the drive I’ve used “Disk Utility”. Just press ‘+’ button and set the desired size for the new partition. Leave ‘Format’ default (“Mac OS Extended (Journaled)”) because you’ll anyway format it with ext4. Then hit ‘Apply’ and that’s it.

Here is mine, though it’s already after I’ve installed Fedora.

Create USB stick

First of all, you can’t use Fedora netinst image, because there is no working open source driver for Broadcom WiFi card that is installed in Macbook Air. So choose a full image that doesn’t require an internet connection like MATE or Gnome.

Now, you have to create USB stick with Fedora. There is a tool called “Fedora Media Writer” that will make bootable stick on macOS but, unfortunately, I’ve failed to boot with it. It seems that after repartitioning on macOS it immediately mounts the new partitions and touch it making it somehow unusable for installation.

So I’ve created USB stick on Linux with simple

$ dd if=Fedora-Workstation-netinst-x86_64-25-1.3.iso of=/dev/sdd bs=1M oflag=direct

Now for the installation part.

Fedora Installation

Boot into USB stick

Insert USB into Macbook, hold “alt” key and press power button still holding “alt” key until you see boot choice menu with Fedora.

MOST IMPORTANT! Linux partitions and installation destination

After booting from USB you’ll see usual Anaconda installer. First and most important we must configure installation destination.

Enter this menu, choose “ATA APPLE SSD” and then choose “I will configure partitioning” and click “Done” in the top of the window.

Expand “Unknown” widget, find your 80 GBs or 74 GiBs partition of type “hfs+” and delete it. Now you’ll see 74 GiBs of available space in the pink rectangle at the bottom.

Now choose “Standart Partition” scheme from the dropdown menu in “New Fedora 25 Installation” widget, and then click on the link “Click here to create them automatically”.

It will create separate / and /home partitions and also a whooping 8 GBs swap. You can tweak automatically created scheme at your taste, just don’t touch “/boot/efi” partition or otherwise it won’t boot. I’ve changed swap size to 2 GBs, removed /home and / partition and manually add / partition to span all available space of almost 80 GBs.

Also, I setup LUKS encryption for my partitions, because it’s a laptop after all, if I lose it you won’t be able to steal my stuff by directly connecting the SSD drive. Also, LUKS encryption doesn’t make any performance penalty.

Then hit “Done” and confirm your disk layout.

Configure installation

Now when you have partitioning configured, just setup your installation with Anaconda.

To make hardware work nicely like brightness control and lid close/open install some DE like MATE in my case. DEs have decent udev rules and configs for hardware. It also setup display manager (the one that asks for the login and password) and X server. It’s amazing how everything works out of the box. Something like 5 years ago it was a pain to make mic and brightness work and now you just don’t worry. Kudos to distro and DE guys!

You can stick with MATE but I’ll install and configure i3 window manager over MATE.

Wait until installation is done

and then reboot into your fresh Fedora by holding “alt” key.

Install WiFi drivers

Macbook Air has crappy proprietary Broadcom WiFi chips. To make it work you’ll need an alternative network. You can use USB to Ethernet cable, or, as in my case, you can use your Android phone as a modem. No seriously, just attach your Android phone, select Modem mode and you’ll immediately see the network connected.

Now, when you have a network, to install Broadcom WiFi drivers open root terminal and do the following:

# Enable RPM fusion repo dnf install https://download1.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm https://download1.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm # Install packages dnf install -y broadcom-wl akmods "kernel-devel-uname-r == $(uname -r)" # Rebuild driver for your kernel akmods # Load the new driver modprobe wl

After that, you’ll have WiFi working.

Making things nice (for me)

Now it’s time for tweaking. My favorite!

Enable fnlock

By default, function keys are working as multimedia keys. To revert it back to the functions we have to enable so-called fn lock.

Create file /etc/modprobe.d/hid_apple.conf as root and add the following to it:

options hid_apple fnmode=2

Don’t try to remove hid_apple kernel module - your keyboard stop working. Just reboot.

Infinality patches

Infinality is a set of patches for fontconfig that makes fonts looking gorgeous. I dare you to try it - after it, anything else will look like a crap including macOS fonts:

dnf copr enable caoli5288/infinality-ultimate dnf install --allowerasing cairo-infinality-ultimate freetype-infinality-ultimate fontconfig-infinality-ultimate

Getting my configs

Because Linux software is awesome and has text configs, I store most of them in Dropbox and put known and loved configuration by simple copying or symlinking.

Install headless Dropbox:

cd ~ && wget -O - "https://www.dropbox.com/download?plat=lnx.x86_64" | tar xzf -

And put dropbox CLI client to your ~/bin folder:

mkdir -p ~/bin && cd ~/bin && wget https://www.dropbox.com/download?dl=packages/dropbox.py

Now launch it with dropbox start.

Installing i3 for MATE

Ok, so before that I was using MATE and while it’s nice I prefer tiling WM, namely i3. I install it with dnf:

dnf install i3

and then copy or symlink ~/.i3 directory with the configuration in my Dropbox. But what is really awesome is that we can use i3wm instead of MATE’s window manager

Marco. This way we’ll have all the niceties of DE like working multimedia buttons and brightness control while using our slick and nice tiling WM.

To change MATE’s window manager just issue these 2 commands under your user (no need for sudo):

dconf write /org/mate/desktop/session/required-components/windowmanager "'i3'" dconf write /org/mate/desktop/session/required-components-list "['windowmanager']"

Logout and login and you’ll have it!

To exit from i3 as a window manager for MATE, use this in your i3 config

bindsym $mod+Shift+q exec "mate-session-save --logout"

Settings

Everything else I configure with mate-control-center.

Conclusion

So the hardest part in installing Fedora on Macbook Air is partitioning and WiFi driver. Everything else just works!

After using this setup for a couple of months I can say that it’s great. There are things that I wish could be better, but it’s mostly about hardware. Like screen is crappy 1440x900 and keyboard is way too limited (no separate home/end, have to use fn+left/right). I would rather use some lightweight Thinkpad. But anyway, the freedom to move your workspace with you is amazing, so I think I’ll never buy a desktop machine anymore.

Solving pointers problems

2017-02-01T00:00:00+00:00

It’s not secret that the hardest part in C programming is working with pointers. They seem simple - “A pointer is a variable that contains the address of a variable” (K&R Chapter 5). But when you start working with it, it’s so easy to mess up with stars and ampersands and arrows and stuff.

Most of the times you can get away with some shallow understanding of pointers. Indeed, even in production code you rarely see anything other than taking a pointer from malloc and passing it to some functions. And that’s where you are caught on C programming interviews questions because people love to ask tricky pointer questions. Like, write a function to reverse a linked list or do an in-order traversal of the binary tree.

I actually failed one interview back in 2012 because I failed to write a function that reverts a linked list. Yeah, I was depressed. Since then I promised myself that I will figure out how this shit really works. So this is my pointers epiphany post.

I think that the key to solving any pointers problem is to draw them correctly. Let me show you an example of linked list because it has a lot of pointers:

Each element is 2 squares - one for the “payload” variable and another for the pointer variable. Last pointer value is, of course, NULL. Head of the list is a pointer and it’s drawn in a “box” as any other variable.

It’s of paramount importance to draw pointers in boxes as any other variables and showing with the arrow where the pointer value points because this representation will help you to understand pointers code.

For example, here is the code to iterate over a linked list:

struct list *cur = head; while (cur) { printf("cur is %p, val is %d\n", cur, cur->n); cur = cur->next; }
You can kind of understand it by intuition but do you really understand why and how cur = cur->next works? Draw a picture!

cur = cur->next is doing its magic because arrow operator in C translates to this: cur = (*cur).next. First, you are dereferencing a pointer - that gives you a value under the pointer. Second, you get the value of next pointer. Third, you copy that value to the cur. This is how it allows you to jump over the pointers.

If it doesn’t click, don’t worry. Take your time, draw it yourself and make it sink.

Now, when it seems easy, let’s look at the double pointer or pointer to pointer.

Here is the same iteration but with double pointers:

struct list **pp = &head; while (*pp) { cur = *pp; printf("cur is %p, val is %d\n", cur, cur->n); *pp = &(cur->next); }
And here is the representation of it:

Double pointers are useful because they allow you to change the underlying pointer and value. Here is the illustration of why it’s possible:

Note, that *pp is a pointer, but it’s a different “box” than pp. pp points to the pointer, while *pp points to value.

All of this may not sound useful at first but, without double pointers, some code is much harder to read and some not even possible.

Take for example task of removing an element from a linked list. You have to iterate over the list to find the element to delete, then you have to delete it. Deleting an element from linked list is an update of adjacent pointers. This includes head pointer because you may need to remove the first element.

If you iterate over elements with a simple pointer, like in my first example, you have to have cur and prev pointers to make the previous pointer around deleted element. That’s OK, but you also need a special case if prev pointer is the head because head must be updated. Here is the code:

void list_remove(int i, struct list **head) { struct list *cur = *head; struct list *prev = NULL; while (cur->next) { if (cur->n == i) { if (prev) { // Make previous pointer around deleted element prev->next = cur->next; } else { // prev == NULL means we removing head, // so shift head to next element. *head = cur->next; } free(cur); } // Iterating... prev = cur; cur = cur->next; } }
It works but seems a bit complicated - it requires comments explaining what’s happening here. With double pointers it looks like a breeze:

void list_remove_pp(int i, struct list **head) { struct list **pp; struct list *cur; pp = head; while (*pp) { cur = *pp; if (cur->n == i) { *pp = cur->next; free(cur); } pp = &((*pp)->next); } }
Because we use double pointers, we don’t have a special case for head - with pp we can modify it just as any other pointer in the list.

So the next time you’ll find yourself struggle with some pointer problem - draw a picture showing pointers as any other variable and you’ll find the answer.

Just remember, there is no magic here - pointer is just a usual variable, but you work with it in an unusual way.

On dynamic arrays

2016-06-26T00:00:00+00:00

I was reading Skiena’s “Algorithm Design Manual”, it’s an amazing book by the way, and run into this comparison (chapter 3.1.3) of linked lists and arrays:

The relative advantages of linked lists over static arrays include: • Overflow on linked structures can never occur unless the memory is actually full • Insertions and deletions are simpler than for contiguous (array) lists. • With large records, moving pointers is easier and faster than moving the items themselves. while the relative advantages of arrays include: • Linked structures require extra space for storing pointer fields. • Linked lists do not allow efficient random access to items. • Arrays allow better memory locality and cache performance than random pointer jumping.
Mr. Skiena gives a comprehensive comparison but unfortunately doesn’t stress enough the last point. As a system programmer, I know that memory access patterns, effective caching and exploiting CPU pipelines can be and is a game changer, and I would like to illustrate it here.

Let’s make a simple test and compare the performance of linked list and dynamic array data structures on basic operations like inserting and searching.

I’ll use Java as a perfect computer science playground tool. In Java, we have LinkedList and ArrayList - classes that implement linked list and dynamic array correspondingly, and both implement the same List interface.

Our tests will include:

Allocation by inserting 1 million random elements.

Inserting 10 000 elements in random places.

Inserting 10 000 elements to the head.

Inserting 10 000 elements to the tail.

Searching for a 10 000 random elements.

Deleting all elements.

Sources are at my CS playground in ds/list-perf dir. There is Maven project, so you can just do mvn package and get a jar. Tests are quite simple, for example, here is the random insertion test:

package com.dzyoba.alex; import java.util.List; import java.util.Random; public class TestInsert implements Runnable { private List<Integer> list; private int listSize; private int randomOps; public TestInsert(List<Integer> list, int randomOps) { this.list = list; this.randomOps = randomOps; } public void run() { int index, element; int listSize = list.size(); Random randGen = new Random(); for (int i = 0; i < randomOps; i++) { index = randGen.nextInt(listSize); element = randGen.nextInt(listSize); list.add(index, element); } } }
It’s working using List interface (yay, polymorphism!), so we can pass LinkedList and ArrayList without changing anything. It runs tests in the order mentioned above (allocation->insertions->search->delete) several times and calculating min/median/max of all test results.

Alright, enough words, let’s run it!

$ time java -cp target/TestList-1.0-SNAPSHOT.jar com.dzyoba.alex.TestList Testing LinkedList Allocation: 7/22/442 ms Insert: 9428/11125/23574 ms InsertHead: 0/1/3 ms InsertTail: 0/1/2 ms Search: 25069/27087/50759 ms Delete: 6/7/13 ms ------------------ Testing ArrayList Allocation: 6/8/29 ms Insert: 1676/1761/2254 ms InsertHead: 4333/4615/5855 ms InsertTail: 0/0/2 ms Search: 9321/9579/11140 ms Delete: 0/1/5 ms real 10m31.750s user 10m36.737s sys 0m1.011s

You can see with the naked eye that LinkedList loses. But let me show you nice box plots:

And here is the link to all tests combined

In all operations, LinkedList sucks horribly. The only exception is the insert to the head, but that’s a playing against worst-case of dynamic array – it has to copy the whole array every time.

To explain this, we’ll dive a little bit into implementation. I’ll use OpenJDK sources of Java 8.

So, ArrayList and LinkedList sources are in src/share/classes/java/util

LinkedList in Java is implemented as a doubly-linked list via Node inner class:

private static class Node<E> { E item; Node<E> next; Node<E> prev; Node(Node<E> prev, E element, Node<E> next) { this.item = element; this.next = next; this.prev = prev; } }
Now, let’s look at what’s happening under the hood in the simple allocation test.

for (int i = 0; i < listSize; i++) { list.add(i); }
It invokes add method which invokes linkLast method in JDK:

public boolean add(E e) { linkLast(e); return true; } void linkLast(E e) { final Node<E> l = last; final Node<E> newNode = new Node<>(l, e, null); last = newNode; if (l == null) first = newNode; else l.next = newNode; size++; modCount++; }
Essentially, allocation in LinkedList is a constant time operation. LinkedList class maintains the tail pointer, so to insert it just has to allocate a new object and update 2 pointers. It shouldn’t be that slow! But why does it happens? Let’s compare with ArrayList.

public boolean add(E e) { ensureCapacityInternal(size + 1); // Increments modCount!! elementData[size++] = e; return true; } private void ensureCapacityInternal(int minCapacity) { if (elementData == EMPTY_ELEMENTDATA) { minCapacity = Math.max(DEFAULT_CAPACITY, minCapacity); } ensureExplicitCapacity(minCapacity); } private void ensureExplicitCapacity(int minCapacity) { modCount++; // overflow-conscious code if (minCapacity - elementData.length > 0) grow(minCapacity); } private void grow(int minCapacity) { // overflow-conscious code int oldCapacity = elementData.length; int newCapacity = oldCapacity + (oldCapacity >> 1); if (newCapacity - minCapacity < 0) newCapacity = minCapacity; if (newCapacity - MAX_ARRAY_SIZE > 0) newCapacity = hugeCapacity(minCapacity); // minCapacity is usually close to size, so this is a win: elementData = Arrays.copyOf(elementData, newCapacity); }
ArrayList in Java is, indeed, a dynamic array that increases its size in 1.5 each grow with the initial capacity of 10. Also this //overflow-conscious code is actually pretty funny. You can read why is that so here)

The resizing itself is done via Arrays.copyOf which calls System.arraycopy which is a Java native method. Implementation of native methods is not part of JDK, it’s a particular JVM function. Let’s grab Hotspot source code and look into it.

Long story short - it’s in TypeArrayKlass::copy_array method that invokes Copy::conjoint_memory_atomic. This one is looking for alignment, namely there are variant for long, int, short and bytes(unaligned) copy. We’ll look plain int variant - conjoint_jints_atomic which is a wrapper for pd_conjoint_jints_atomic. This one is OS and CPU specific. Looking for Linux variant we’ll find a call to _Copy_conjoint_jints_atomic. And the last one is an assembly beast!

# Support for void Copy::conjoint_jints_atomic(void* from, # void* to, # size_t count) # Equivalent to # arrayof_conjoint_jints .p2align 4,,15 .type _Copy_conjoint_jints_atomic,@function .type _Copy_arrayof_conjoint_jints,@function _Copy_conjoint_jints_atomic: _Copy_arrayof_conjoint_jints: pushl %esi movl 4+12(%esp),%ecx # count pushl %edi movl 8+ 4(%esp),%esi # from movl 8+ 8(%esp),%edi # to cmpl %esi,%edi leal -4(%esi,%ecx,4),%eax # from + count*4 - 4 jbe ci_CopyRight cmpl %eax,%edi jbe ci_CopyLeft ci_CopyRight: cmpl $32,%ecx jbe 2f # <= 32 dwords rep; smovl popl %edi popl %esi ret .space 10 2: subl %esi,%edi jmp 4f .p2align 4,,15 3: movl (%esi),%edx movl %edx,(%edi,%esi,1) addl $4,%esi 4: subl $1,%ecx jge 3b popl %edi popl %esi ret ci_CopyLeft: std leal -4(%edi,%ecx,4),%edi # to + count*4 - 4 cmpl $32,%ecx ja 4f # > 32 dwords subl %eax,%edi # eax == from + count*4 - 4 jmp 3f .p2align 4,,15 2: movl (%eax),%edx movl %edx,(%edi,%eax,1) subl $4,%eax 3: subl $1,%ecx jge 2b cld popl %edi popl %esi ret 4: movl %eax,%esi # from + count*4 - 4 rep; smovl cld popl %edi popl %esi ret
The point is not that VM languages are slower, but that random memory access kills performance. The essence of conjoint_jints_atomic is rep; smovl¹. And if CPU really loves something it is rep instructions. For this, CPU can pipeline, prefetch, cache and do all the things it was built for - streaming calculations and predictable memory access. Just read the awesome “Modern Microprocessors. A 90 Minute Guide!”.

What it’s all mean is that for application rep smovl is not really a linear operation, but somewhat constant. Let’s illustrate the last point. For a list of 1 000 000 elements let’s do insertion to the head of the list for 100, 1000 and 10000 elements. On my machine I’ve got the next samples:

100 TestInsertHead: [41, 42, 42, 43, 46]

1000 TestInsertHead: [409, 409, 411, 411, 412]

10000 TestInsertHead: [4163, 4166, 4175, 4198, 4204]

Each 10 times increase results in the same 10 times increase of operations because it’s “10 * O(1)”.

Experienced developers are engineers and they know that computer science is not software engineering . What’s good in theory might be wrong in practice because you don’t take into account all the factors. To succeed in the real world, knowledge of the underlying system and how it works is incredibly important and can be a game changer.

And it’s not only my opinion, a couple of years ago² there was a link on Reddit - Bjarne Stroustrup: Why you should avoid LinkedLists. And I agree with his points. But, of course, be sane, don’t blindly trust anyone or anything - measure, measure, measure.

And Here I would like to leave you with my all-time favorite “The Night Watch” by James Mickens.

gas requires rep instruction to be on 2 lines, but with the semicolon, you can workaround this ↩︎

Gosh, I still remember this link! ↩︎

Basic x86 interrupts

2016-04-02T00:00:00+00:00

From my article on a multiboot kernel, we saw how to load a trivial kernel, print text and halt forever. However, to make it usable I want keyboard input, where things I type will be printed on the screen.

There is more work than you might initially think because it requires initialization of x86 interrupts: this quirky and tricky x86 routine of 40 years legacy.

x86 interrupts

Interrupts are events from devices to the CPU signalizing that device has something to tell, like user input on the keyboard or network packet arrival. Without interrupts you should’ve been polling all your peripherals, thus wasting CPU time, introducing latency and being a horrible person.

There are 3 sources or types of interrupts:

Hardware interrupts - comes from hardware devices like keyboard or network card.

Software interrupts - generated by the software int instruction. Before introducing SYSENTER/SYSEXIT system calls invocation was implemented via the software interrupt int $0x80.

Exceptions - generated by CPU itself in response to some error like “divide by zero” or “page fault”.

x86 interrupt system is tripartite in the sense of it involves 3 parts to work conjointly:

Programmable Interrupt Controller (PIC) must be configured to receive interrupt requests (IRQs) from devices and send them to CPU.

CPU must be configured to receive IRQs from PIC and invoke correct interrupt handler, via gate described in an Interrupt Descriptor Table (IDT).

Operating system kernel must provide Interrupt Service Routines (ISRs) to handle interrupts and be ready to be preempted by an interrupt. It also must configure both PIC and CPU to enable interrupts.

Here is the reference figure, check it as you read through the article

Before proceeding to configure interrupts we must have GDT setup as we did before.

Programmable interrupt controller (PIC)

PIC is the piece of hardware that various peripheral devices are connected to instead of CPU. Being essentially a multiplexer/proxy, it saves CPU pins and provides several nice features:

More interrupt lines via PIC chaining (2 PICs give 15 interrupt lines)

Ability to mask particular interrupt line instead of all (cli)

Interrupts queueing, i.e. order interrupts delivery to the CPU. When some interrupt is disabled, PIC queues it for later delivery instead of dropping.

Original IBM PCs had separate 8259 PIC chip. Later it was integrated as part of southbridge/ICH/PCH. Modern PC systems have APIC (advanced programmable interrupt controller) that solves interrupts routing problems for multi-core/processors machines. But for backward compatibility, APIC emulates good ol’ 8259 PIC. So if you’re not on an ancient hardware, you actually have an APIC that is configured in some way by you or BIOS. In this article, I will rely on BIOS configuration and will not configure PIC for 2 reasons. First, it’s a shitload of quirks that impossible for the sensible human to figure out, and second, later we will configure APIC mode for SMP. BIOS will configure APIC as in IBM PC AT machine, i.e. 2 PICs with 15 lines.

Apart from the line for raising interrupts in CPU, PIC is connected to the CPU data bus. This bus is used to send IRQ number from PIC to CPU and to send configuration commands from CPU to PIC. Configuration commands include PIC initialization (again, won’t do this for now), IRQ masking, End-Of-Interrupt (EOI) command and so on.

Interrupt descriptor table (IDT)

Interrupt descriptor table (IDT) is an x86 system table that holds descriptors for Interrupt Service Routines (ISRs) or simply interrupt handlers.

In real mode, there is an IVT (interrupt vector table) which is located by the fixed address 0x0 and contains “interrupt handler pointers” in the form of CS and IP registers values. This is really inflexible and relies on segmented memory management, and since 80286, there is an IDT for protected mode.

IDT is the table in memory, created and filled by OS that is pointed by idtr system register which is loaded with lidt instruction. You can use IDT only in protected mode. IDT entries contain gate descriptors - not only addresses of interrupts handlers (ISRs) in 32-bit form but also flags and protection levels. IDT entries are descriptors that describe interrupt gates, and so in this sense, it resembles GDT and its segment descriptors. Just look at them:

The main part of the descriptor is offset - essentially a pointer to an ISR within code segment chosen by segment selector. The latter consists of an index in GDT table, table indicator (GDT or LDT) and Request Privilege Level (RPL). For interrupt gates, selectors are always for Kernel code segment in GDT, that is it’s 0x08 for first GDT entry (each is 8 byte) with 0 RPL and 0 for GDT.

Type specifies gate type - task, trap or interrupt. For interrupt handler, we’ll use interrupt gate, because for interrupt gate CPU will clear IF flag as opposed to trap gate, and TSS won’t be used as opposed to task gate (we don’t have one yet).

So basically, you just fill the IDT with descriptors that differ only in offset, where you put the address of ISR function.

Interrupt service routines (ISR)

The main purpose of IDT is to store pointers to ISR that will be automatically invoked by CPU on interrupt receive. The important thing here is that you can NOT control invocation of an interrupt handler. Once you have configured IDT and enabled interrupts (sti) CPU will eventually pass the control to your handler after some behind the curtain work. That “behind the curtain work” is important to know.

If an interrupt occurred in userspace (actually in a different privilege level), CPU does the following¹:

Temporarily saves (internally) the current contents of the SS, ESP, EFLAGS, CS and EIP registers.

Loads the segment selector and the stack pointer for the new stack (that is, the stack for the privilege level being called) from the TSS into the SS and ESP registers and switches to the new stack.

Pushes the temporarily saved SS, ESP, EFLAGS, CS, and EIP values for the interrupted procedure’s stack onto the new stack.

Pushes an error code on the new stack (if appropriate).

Loads the segment selector for the new code segment and the new instruction pointer (from the interrupt gate or trap gate) into the CS and EIP registers, respectively.

If the call is through an interrupt gate, clears the IF flag in the EFLAGS register.

Begins execution of the handler procedure at the new privilege level.

If an interrupt occurred in kernel space, CPU will not switch stacks, meaning that in kernel space interrupt doesn’t have its own stack, instead, it uses the stack of the interrupted procedure. On x64 it may lead to stack corruption because of the red zone, that’s why kernel code must be compiled with -mno-red-zone. I have a funny story about this.

When an interrupt occurs in kernel mode, CPU will:

Push the current contents of the EFLAGS, CS, and EIP registers (in that order) on the stack.

Push an error code (if appropriate) on the stack.

Load the segment selector for the new code segment and the new instruction pointer (from the interrupt gate or trap gate) into the CS and EIP registers, respectively.

Clear the IF flag in the EFLAGS, if the call is through an interrupt gate.

Begin execution of the handler procedure.

Note, that these 2 cases differ in what is pushed onto the stack. EFLAGS, CS and EIP is always pushed while interrupt in userspace mode will additionally push old SS and ESP.

This means that when interrupt handler begins it has the following stack:

Now, when the control is passed to the interrupt handler, what should it do?

Remember, that interrupt occurred in the middle of some code in userspace or even kernelspace, so the first thing to do is to save the state of the interrupted procedure before proceeding to interrupt handling. Procedure state is defined by its registers, and there is a special instruction pusha that saves general purpose registers onto the stack.

Next thing is to completely switch the environment for interrupt handler in the means of segment registers. CPU automatically switches CS, so interrupt handler must reload 4 data segment register DS, FS, ES and GS. And don’t forget to save and later restore the previous values.

After the state is saved and the environment is ready, interrupt handler should do its work whatever it is, but first and most important to do is to acknowledge interrupt by sending special EOI command to PIC.

Finally, after doing all its work there should be clean return from interrupt, that will restore the state of interrupted procedure (restore data segment registers, popa), enable interrupts (sti) that were disabled by CPU before entering ISR (penultimate step of CPU work) and call iret.

Here is the basic ISR algorithm:

Save the state of interrupted procedure

Save previous data segment

Reload data segment registers with kernel data descriptors

Acknowledge interrupt by sending EOI to PIC

Do the work

Restore data segment

Restore the state of interrupted procedure

Enable interrupts

Exit interrupt handler with iret

Putting it all together

Now to complete the picture let’s see how keyboard press is handled:

Setup interrupts:

Create IDT table

Set IDT entry #9 ² with interrupt gate pointing to keyboard ISR

Load IDT address with lidt

Send interrupt mask 0xfd (11111101) to PIC1 to unmask (enable) IRQ1

Enable interrupts with sti

Human hits keyboard button

Keyboard controller raises interrupt line IRQ1 in PIC1

PIC checks if this line is not masked (it’s not) and send interrupt number 9 to CPU

CPU checks if interrupts disabled by checking IF in EFLAGS (it’s not)

(Assume that currently we’re executing in kernel mode)

Push EFLAGS, CS, and EIP on the stack

Push an error code from PIC (if appropriate) on the stack

Look into IDT pointed by idtr and fetch segment selector from IDT descriptor 9.

Check privilege levels and load segment selector and ISR address into the CS:EIP

Clear IF flag because IDT entries are interrupt gates

Pass control to ISR

Receive interrupt in ISR:

Disable interrupt with cli (just in case)

Save interrupted procedure state with pusha

Push current DS value on the stack

Reload DS, ES, FS, GS from kernel data segment

Acknowledge interrupt by sending EOI (0x20) to master PIC (I/O port 0x20)

Read keyboard status from keyboard controller (I/O port 0x64)

If status is 1 then read keycode from keyboard controller (I/O port 0x60)

Finally, print char via VGA buffer or send it to TTY

Return from interrupt:

Pop from stack and restore DS

Restore interrupted procedure state with popa

Enable interrupts with sti

iret

Note, that this happens every time you hit the keyboard key. And don’t forget that there are few dozens of other interrupts like clocks, network packets and such that is handled seamlessly without you even noticing that. Can you imagine how fast is your hardware? Can you imagine how well written your operating system is? Now think about it and give OS writers and hardware designers a good praise.

Citing “Intel software developer’s manual, Volume 1”. ↩︎

Without PIC programming and remapping interrupts, keyboard has interrupt number 9 in CPU (but IRQ1 in PIC) ↩︎

OS segmentation

2015-12-27T00:00:00+00:00

Previously, I had boot the trivial Multiboot kernel. Despite it was really fun, I need more than just showing a letter on a screen. My goal is to write a simple kernel with Unix-ready userspace.

I have been writing my kernel for the last couple of months (on and off) and with help of OSDev wiki I got a quite good kernel based on meaty skeleton and now I want to go further. But where to? My milestone is to make keyboard input working. This will require working interrupts, but it’s not the first thing to do.

According to Multiboot specification after bootloader passed the control to our kernel, the machine is in pretty reasonable state except 3 things (quoting chapter 3.2. Machine state):

‘ESP’ - The OS image must create its own stack as soon as it needs one.

‘GDTR’ - Even though the segment registers are set up as described above, the ‘GDTR’ may be invalid, so the OS image must not load any segment registers (even just reloading the same values!) until it sets up its own ‘GDT’.

‘IDTR’ The OS image must leave interrupts disabled until it sets up its own IDT.

Setting up a stack is simple - you just put 2 labels divided by your stack size. In “hydra” it’s 16 KiB:

# Reserve a stack for the initial thread. .section .bootstrap_stack, "aw", @nobits stack_bottom: .skip 16384 # 16 KiB stack_top:
Next, we need to setup segmentation. We have to do this before setting up interrupts because each IDT descriptor gate must contain segment selector for destination code segment - a kernel code segment that we must setup.

Nevertheless, it almost certainly will work even without setting up GDT because Multiboot bootloader sets it by itself and we left with its configuration that usually will set up flat memory model. For example, here is the GDT that legacy grub set:

Index Base Size DPL Info

00 (Selector 0x0000) 0x0 0xfff0 0 Unused

01 (Selector 0x0008) 0x0 0xffffffff 0 32-bit code

02 (Selector 0x0010) 0x0 0xffffffff 0 32-bit data

03 (Selector 0x0018) 0x0 0xffff 0 16-bit code

04 (Selector 0x0020) 0x0 0xffff 0 16-bit data

It’s fine for kernel-only mode because it has 32-bit segments for code and data of size 232, but no segments with DPL=3 and also 16-bit code segments that we don’t want.

But really it is just plain stupid to rely on undefined values, so we set up segmentation by ourselves.

Segmentation on x86

Segmentation is a technique used in x86 CPUs to expand the amount of available memory. There are 2 different segmentation models depending on CPU mode - real-address model and protected model.

Segmentation in Real mode

Real mode is a 16-bit Intel 8086 CPU mode, it’s a mode where processor starts working upon reset. With a 16-bit processor, you may address at most 216 = 64 KiB of memory which even by the times of 1978 was way too small. So Intel decided to extend address space to 1 MiB and made address bus 20 bits wide (220 = 1048576 bytes = 1 MiB). But you can’t address 20 bits wide address space with 16-bit registers, you have to expand your registers by 4 bits. This is where segmentation comes in.

The idea of segmentation is to organize address space in chunks called segments, where your address from 16-bit register would be an offset in the segment.

With segmentation, you use 2 registers to address memory: segment register and general-purpose register representing offset. Linear address (the one that will be issued on the address bus of CPU) is calculated like this:

Linear address = Segment << 4 + Offset

Note, that with this formula it’s up to you to choose segments size. The only limitation is that segments size is at least 16 bytes, implied by 4 bit shift, and the maximum of 64 KiB implied by Offset size.

In the example above we’ve used logical address 0x0002:0x0005 that gave us linear address 0x00025. In my example I’ve chosen to use 32 bytes segments, but this is only my mental representation - how I choose to construct logical addresses. There are many ways to represent the same address with segmentation:

0x0000:0x0025 = 0x0 << 4 + 0x25 = 0x00 + 0x25 = 0x00025 0x0002:0x0005 = 0x2 << 4 + 0x05 = 0x20 + 0x05 = 0x00025 0xffff:0x0035 = 0xffff0 + 0x35 = 0x100025 = (Wrap around 20 bit) = 0x00025 0xfffe:0x0045 = 0xfffe0 + 0x45 = 0x100025 = (Wrap around 20 bit) = 0x00025 ...

Note the wrap around part. this is where it starts to be complicated and it’s time to tell the fun story about Gate-A20.

On Intel 8086, segment register loading was a slow operation, so some DOS programmers used a wrap-around trick to avoid it and speed up the programs. Placing the code in high addresses of memory (close to 1MiB) and accessing data in lower addresses (I/O buffers) was possible without reloading segment thanks to wrap-around.

Now Intel introduces 80286 processor with 24-bit address bus. CPU started in real mode assuming 20-bit address space and then you could switch to protected mode and enjoy all 16 MiB of RAM available for your 24-bit addresses. But nobody forced you to switch to protected mode. You could still use your old programs written for the Real mode. Unfortunately, 80286 processor had a bug - in the Real mode it didn’t zero out 21st address line - A20 line (starting from A0). So the wrap-around trick was not longer working. All those tricky speedy DOS programs were broken!

IBM that was selling PC/AT computers with 80286 fixed this bug by inserting logic gate on A20 line between CPU and system bus that can be controlled from software. On reset BIOS enables A20 line to count system memory and then disables it back before passing control to operating CPU, thus enabling wrap-around trick. Yay! Read more shenanigans about A20 here.

So, from now on all x86 and x86_64 PCs has this Gate-A20. Enabling it is one of the required things to switch into protected mode.

Needless to say that Multiboot compatible bootloader enables it and switching into protected mode before passing control to the kernel.

Segmentation in Protected mode

As you might saw in the previous section, segmentation is an awkward and error-prone mechanism for memory organization and protection. Intel had understood it quickly and in 80386 introduced paging - flexible and powerful system for real memory management. Paging is available only in protected mode - successor of the real mode that was introduced in 80286, providing new features in segmentation like segment limit checking, read-only and execute-only segments and 4 privilege levels (CPU rings).

Although paging is the mechanism for memory management when operating in protected mode all memory references are subject of segmentation for the sake of backward compatibility. And it drastically differs from segmentation in real mode.

In protected mode, instead of segment base, segment register holds a segment selector, a value used to index segments table called Global Descriptor Table (GDT). This selector chooses an entry in GDT called Segment Descriptor. Segment descriptor is an 8 bytes structure that contains the base address of the segment and various fields used for various design choices howsoever exotic they are.

GDT is located in memory (on 8 bytes boundary) and pointed by gdtr register.

All memory operations either explicitly or implicitly contain segment registers. CPU uses the segment register to fetch segment selector from GDT, finds out that segment base address and add offset from memory operand to it.

You can mimic real-mode segmentation model by configuring overlapping segments. And actually, absolutely most of operating systems do this. They setup all segments from 0 to 4 GiB, thus fully overlapping and carry out memory management to paging.

How to configure segmentation in protected mode

First of all, let’s make it clear - there is a lot of stuff. When I was reading Intel System programming manual, my head started hurting. And actually, you don’t need all this stuff because it’s segmentation and you want to set it up so it will just work and prepare the system for paging.

In most cases, you will need at least 4 segments:

Null segment (required by Intel)

Kernel code segment

Kernel data segment

Userspace code segment

Userspace data segment

This structure not only sane but is also required if you want to use SYSCALL/SYSRET - fast system call mechanism without CPU exception overhead of int 0x80.

These 4 segments are “non-system”, as defined by a flag S in segment descriptor. You use such segments for normal code and data, both for kernel and userspace. There are also “system” segments that have special meaning for CPU. Intel CPUs support 6 system descriptors types of which you should have at least one Task-state segment (TSS) for each CPU (core) in the system. TSS is used to implement multi-tasking and I’ll cover it in later articles.

Four segments that we set up differs in flags. Code segments are execute/read only, while data segments are read/write. Kernel segments differ from userspace by DPL - descriptor privilege level. Privilege levels form CPU protection rings. Intel CPUs have 4 rings, where 0 is the most privileged and 3 is least privileged.

CPU rings is a way to protect privileged code such as operating system kernel from direct access of wild userspace. Usually, you create kernel segments in a ring 0 and userspace segments in ring 3. It’s not that it’s impossible to access kernel code from userspace, it is, but there is a well-defined, controlled by the kernel, mechanism involving (among other things) switch from ring 3 to ring 0.

Besides DPL (descriptor privilege level) that is stored in segment descriptor itself there are also CPL (Current Privilege Level) and RPL (Requested Privilege Level). CPL is stored in CS and SS segment registers. RPL is encoded in segment selector. Before loading segment selector into segment register CPU performs privilege check, using this formula

MAX(CPL, RPL) <= DPL

Because RPL is under calling software control, it may be used to tamper privileged software. To prevent this CPL is used in access checking.

Let’s look how control is transferred between code segments. We will look into the simplest case of control transfer with far jmp/call, Special instructions SYSENTER/SYSEXIT, interrupts/exceptions and task switching is another topic.

Far jmp/call instructions in contrast to near jmp/call contain segment selector as part of the operand. Here are examples

jmp eax ; Near jump jmp 0x10:eax ; Far jump
When you issue far jmp/call CPU takes CPL from CS, RPL from segment selector encoded into far instruction operand and DPL from target segment descriptor that is found by offset from segment selector. Then it performs privilege check. If it was successful, segment selector is loaded into the segment register. From now you’re in a new segment and EIP is an offset in this segment. Called procedure is executed in its own stack. Each privilege level has its own stack. Fourth privilege level stack is pointed by SS and ESP register, while stack for privilege levels 2, 1 and 0 is stored in TSS.

Finally, let’s see how it’s all working.

As you might saw, things got more complicated and conversion from logical to linear address (without paging it’ll be physical address) now goes like this:

Logical address is split into 2 parts: segment selector and offset

If it’s not a control transfer instruction (far jmp/call, SYSENTER/SYSCALL, call gate, TSS or task gate) then go to step 8.

If it’s a control transfer instruction then load CPL from CS, RPL from segment selector and DPL from descriptor pointed by segment selector.

Perform access check: MAX(CPL,RPL) <= DPL.

If it’s not successful, then generate #GF exception (General Protection Fault)

Otherwise, load segment register with segment selector.

Fetch based address, limit and access information and cache in hidden part of segment register

Finally, add current segment base address taken from segment register (actually cached value from hidden part of segment register) and offset taken from the logical address (instruction operand), producing the linear address.

Note, that without segments switching address translation is pretty straightforward: take the base address and add offset. Segment switching is a real pain, so most operating systems avoids it and set up just 4 segments - minimum amount to please CPU and protect the kernel from userspace.

Segments layout examples

Linux kernel

Linux kernel describes segment descriptor as desc_struct structure in arch/x86/include/asm/desc_defs.h

struct desc_struct { union { struct { unsigned int a; unsigned int b; }; struct { u16 limit0; u16 base0; unsigned base1: 8, type: 4, s: 1, dpl: 2, p: 1; unsigned limit: 4, avl: 1, l: 1, d: 1, g: 1, base2: 8; }; }; } __attribute__((packed)); #define GDT_ENTRY_INIT(flags, base, limit) { { { \ .a = ((limit) & 0xffff) | (((base) & 0xffff) << 16), \ .b = (((base) & 0xff0000) >> 16) | (((flags) & 0xf0ff) << 8) | \ ((limit) & 0xf0000) | ((base) & 0xff000000), \ } } }
GDT itself defined in arch/x86/kernel/cpu/common.c:

.gdt = { [GDT_ENTRY_KERNEL_CS] = GDT_ENTRY_INIT(0xc09a, 0, 0xfffff), [GDT_ENTRY_KERNEL_DS] = GDT_ENTRY_INIT(0xc092, 0, 0xfffff), [GDT_ENTRY_DEFAULT_USER_CS] = GDT_ENTRY_INIT(0xc0fa, 0, 0xfffff), [GDT_ENTRY_DEFAULT_USER_DS] = GDT_ENTRY_INIT(0xc0f2, 0, 0xfffff), ...
Basically, there is a flat memory model with 4 segments from 0 to 0xfffff * granularity, where granularity flag set to 1 specifies 4096 increments, thus giving us the limit of 4 GiB. Userspace and kernel segments differ in DPL only.

First Linux version 0.01

In the Linux version 0.01, there were no userspace segments. In boot/head.s

_gdt: .quad 0x0000000000000000 /* NULL descriptor */ .quad 0x00c09a00000007ff /* 8Mb */ .quad 0x00c09200000007ff /* 8Mb */ .quad 0x0000000000000000 /* TEMPORARY - don't use */ .fill 252,8,0 /* space for LDT's and TSS's etc */
Unfortunately, I wasn’t able to track down how userspace was set up (TSS only?).

NetBSD

NetBSD kernel defines 4 segments as everybody. In sys/arch/i386/include/segments.h

#define GNULL_SEL 0 /* Null descriptor */ #define GCODE_SEL 1 /* Kernel code descriptor */ #define GDATA_SEL 2 /* Kernel data descriptor */ #define GUCODE_SEL 3 /* User code descriptor */ #define GUDATA_SEL 4 /* User data descriptor */ ...
Segments are set up in sys/arch/i386/i386/machdep.c, function initgdt:

setsegment(&gdt[GCODE_SEL].sd, 0, 0xfffff, SDT_MEMERA, SEL_KPL, 1, 1); setsegment(&gdt[GDATA_SEL].sd, 0, 0xfffff, SDT_MEMRWA, SEL_KPL, 1, 1); setsegment(&gdt[GUCODE_SEL].sd, 0, x86_btop(I386_MAX_EXE_ADDR) - 1, SDT_MEMERA, SEL_UPL, 1, 1); setsegment(&gdt[GUCODEBIG_SEL].sd, 0, 0xfffff, SDT_MEMERA, SEL_UPL, 1, 1); setsegment(&gdt[GUDATA_SEL].sd, 0, 0xfffff, SDT_MEMRWA, SEL_UPL, 1, 1);
Where setsegment has following signature:

void setsegment(struct segment_descriptor *sd, const void *base, size_t limit, int type, int dpl, int def32, int gran)
OpenBSD

Similar to NetBSD, but segments order is different. In sys/arch/i386/include/segments.h:

/* * Entries in the Global Descriptor Table (GDT) */ #define GNULL_SEL 0 /* Null descriptor */ #define GCODE_SEL 1 /* Kernel code descriptor */ #define GDATA_SEL 2 /* Kernel data descriptor */ #define GLDT_SEL 3 /* Default LDT descriptor */ #define GCPU_SEL 4 /* per-CPU segment */ #define GUCODE_SEL 5 /* User code descriptor (a stack short) */ #define GUDATA_SEL 6 /* User data descriptor */ ...
As you can see, userspace code and data segments are at positions 5 and 6 in GDT. I don’t know how SYSENTER/SYSEXIT will work here because you set the value of SYSENTER segment in IA32_SYSENTER_CS MSR. All other segments are calculated as an offset from this MSR, e.g. SYSEXIT target segment is a 16 bytes offset - GDT entry that is after next to SYSENTER segment. In this case, SYSEXIT will hit LDT. Some help from OpenBSD kernel folks will be great here.

Everything else is same.

xv6

xv6 is a re-implementation of Dennis Ritchie’s and Ken Thompson’s Unix Version 6 (v6). It’s a small operating system that is taught at MIT.

It’s really pleasant to read it’s source code. There is a main in main.c that calls seginit in vm.c

This function sets up 6 segments:

#define SEG_KCODE 1 // kernel code #define SEG_KDATA 2 // kernel data+stack #define SEG_KCPU 3 // kernel per-cpu data #define SEG_UCODE 4 // user code #define SEG_UDATA 5 // user data+stack #define SEG_TSS 6 // this process's task state
like this

// Map "logical" addresses to virtual addresses using identity map. // Cannot share a CODE descriptor for both kernel and user // because it would have to have DPL_USR, but the CPU forbids // an interrupt from CPL=0 to DPL=3. c = &cpus[cpunum()]; c->gdt[SEG_KCODE] = SEG(STA_X|STA_R, 0, 0xffffffff, 0); c->gdt[SEG_KDATA] = SEG(STA_W, 0, 0xffffffff, 0); c->gdt[SEG_UCODE] = SEG(STA_X|STA_R, 0, 0xffffffff, DPL_USER); c->gdt[SEG_UDATA] = SEG(STA_W, 0, 0xffffffff, DPL_USER); // Map cpu, and curproc c->gdt[SEG_KCPU] = SEG(STA_W, &c->cpu, 8, 0);
Four segments for kernel and userspace code and data, one for TSS, nice and simple code, clear logic, great OS for education.

To read

Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3a

Gustavo Duartes articles are great as usual (why he’s not writing anymore?):

Memory Translation and Segmentation

CPU Rings, Privilege, and Protection

OsDev wiki topics for GDT:

GDT Tutorial

Global Descriptor Table

SystemTap

2015-11-30T00:00:00+00:00

SystemTap

SystemTap is a profiling and debugging infrastructure based on kprobes. Essentially, it’s a scripting facility for kprobes. It allows you to dynamically instrument the kernel and user application to track down complex and obscure problems in system behavior.

With SystemTap you write a tapscript in a special language inspired by C, awk and dtrace. SystemTap language asks you to write handlers for probes defined in kernel or userspace that will be invoked when execution hits these probes. You can define your own functions and use extensive tapsets library. Language provides you integers, strings, associative arrays and statistics, without requiring types and memory allocation. Comprehensive information about SystemTap language can be found in the language reference.

Scripts that you wrote are “elaborated” (resolving references to tapsets, kernel and userspace symbols), translated to C, wrapped with kprobes API invocation and compiled into the kernel module that, finally, is loaded into the kernel.

Script output and other data collected is transferred from kernel to userspace via high-performance transport like relayfs or netlink.

Setup

Installation part is boring and depends on your distro, on Fedora, it’s as simple as:

$ dnf install systemtap

You will need SystemTap runtime and client tools along with tapsets and other development files for building your modules.

Also, you will need kernel debug info:

$ dnf debuginfo-install kernel

After installation, you may check if it’s working:

$ stap -v -e 'probe begin { println("Started") }' Pass 1: parsed user script and 592 library scripts using 922624virt/723440res/7456shr/715972data kb, in 3250usr/220sys/3577real ms. Pass 2: analyzed script: 1 probe, 0 functions, 0 embeds, 0 globals using 963940virt/765008res/7588shr/757288data kb, in 320usr/10sys/338real ms. Pass 3: translated to C into "/tmp/stapMS0u1v/stap_804234031353467eccd1a028c78ff3e3_819_src.c" using 963940virt/765008res/7588shr/757288data kb, in 0usr/0sys/0real ms. Pass 4: compiled C into "stap_804234031353467eccd1a028c78ff3e3_819.ko" in 9530usr/1380sys/11135real ms. Pass 5: starting run. Started ^CPass 5: run completed in 20usr/20sys/45874real ms.

Playground

Various examples of what SystemTap can do can be found here.

You can see call graphs with para-callgraph.stp:

$ stap para-callgraph.stp 'process("/home/avd/dev/block_hasher/block_hasher").function("*")' \ -c '/home/avd/dev/block_hasher/block_hasher -d /dev/md0 -b 1048576 -t 10 -n 10000' 0 block_hasher(10792):->_start 11 block_hasher(10792): ->__libc_csu_init 14 block_hasher(10792): ->_init 17 block_hasher(10792): <-_init 18 block_hasher(10792): ->frame_dummy 21 block_hasher(10792): ->register_tm_clones 23 block_hasher(10792): <-register_tm_clones 25 block_hasher(10792): <-frame_dummy 26 block_hasher(10792): <-__libc_csu_init 31 block_hasher(10792): ->main argc=0x9 argv=0x7ffc78849278 44 block_hasher(10792): ->bdev_open dev_path=0x7ffc78849130 88 block_hasher(10792): <-bdev_open return=0x163b010 0 block_hasher(10796):->thread_func arg=0x163b2c8 0 block_hasher(10797):->thread_func arg=0x163b330 0 block_hasher(10795):->thread_func arg=0x163b260 0 block_hasher(10798):->thread_func arg=0x163b398 0 block_hasher(10799):->thread_func arg=0x163b400 0 block_hasher(10800):->thread_func arg=0x163b468 0 block_hasher(10801):->thread_func arg=0x163b4d0 0 block_hasher(10802):->thread_func arg=0x163b538 0 block_hasher(10803):->thread_func arg=0x163b5a0 0 block_hasher(10804):->thread_func arg=0x163b608 407360 block_hasher(10799): ->time_diff start={...} end={...} 407371 block_hasher(10799): <-time_diff 407559 block_hasher(10799):<-thread_func return=0x0 436757 block_hasher(10795): ->time_diff start={...} end={...} 436765 block_hasher(10795): <-time_diff 436872 block_hasher(10795):<-thread_func return=0x0 489156 block_hasher(10797): ->time_diff start={...} end={...} 489163 block_hasher(10797): <-time_diff 489277 block_hasher(10797):<-thread_func return=0x0 506616 block_hasher(10803): ->time_diff start={...} end={...} 506628 block_hasher(10803): <-time_diff 506754 block_hasher(10803):<-thread_func return=0x0 526005 block_hasher(10801): ->time_diff start={...} end={...} 526010 block_hasher(10801): <-time_diff 526075 block_hasher(10801):<-thread_func return=0x0 9840716 block_hasher(10804): ->time_diff start={...} end={...} 9840723 block_hasher(10804): <-time_diff 9840818 block_hasher(10804):<-thread_func return=0x0 9857787 block_hasher(10802): ->time_diff start={...} end={...} 9857792 block_hasher(10802): <-time_diff 9857895 block_hasher(10802):<-thread_func return=0x0 9872655 block_hasher(10796): ->time_diff start={...} end={...} 9872664 block_hasher(10796): <-time_diff 9872816 block_hasher(10796):<-thread_func return=0x0 9875681 block_hasher(10798): ->time_diff start={...} end={...} 9875686 block_hasher(10798): <-time_diff 9874408 block_hasher(10800): ->time_diff start={...} end={...} 9874413 block_hasher(10800): <-time_diff 9875767 block_hasher(10798):<-thread_func return=0x0 9874482 block_hasher(10800):<-thread_func return=0x0 9876305 block_hasher(10792): ->bdev_close dev=0x163b010 10180742 block_hasher(10792): <-bdev_close 10180801 block_hasher(10792): <-main return=0x0 10180808 block_hasher(10792): ->__do_global_dtors_aux 10180814 block_hasher(10792): ->deregister_tm_clones 10180817 block_hasher(10792): <-deregister_tm_clones 10180819 block_hasher(10792): <-__do_global_dtors_aux 10180821 block_hasher(10792): ->_fini 10180823 block_hasher(10792): <-_fini Pass 5: run completed in 20usr/3200sys/10716real ms.
You can find generic source of latency with latencytap.stp:

$ stap -v latencytap.stp -c \ '/home/avd/dev/block_hasher/block_hasher -d /dev/md0 -b 1048576 -t 10 -n 1000000' Reason Count Average(us) Maximum(us) Percent% Reading from file 490 49311 53833 96% Userspace lock contention 8 118734 929420 3% Page fault 17 27 65 0% unmapping memory 4 37 55 0% mprotect() system call 6 25 45 0% 4 19 37 0% 3 23 49 0% Page fault 2 24 46 0% Page fault 2 20 36 0%

Note: you may need to change timer interval in latencytap.stp:

-probe timer.s(30) { +probe timer.s(5) {

There is even 2048 written in SystemTap!

All in all, it’s simple and convenient. You can wrap your head around it in a single day! And it works as you expect and this is a big deal because it gives you certainty and confidence in the infirm ground of profiling kernel problems.

Profiling I/O latency for block_hasher

So, how can we use it for profiling kernel, module or userspace application? The thing is that we have almost unlimited power in our hands. We can do whatever we want and howsoever we want, but we must know what we want and express it in SystemTap language.

You have a tapsets – standard library for SystemTap – that contains a massive variety of probes and functions that are available for your tapscripts.

But, let’s be honest, nobody wants to write scripts, everybody wants to use scripts written by someone who has the expertise and who already spent a lot of time, debugged and tweaked the script.

Let’s look at what we can find in SystemTap I/O examples.

There is one that seems legit: “ioblktime”. Let’s launch it:

stap -v ioblktime.stp -o ioblktime -c \ '/home/avd/dev/block_hasher/block_hasher -d /dev/md0 -b 1048576 -t 10 -n 10000'

Here’s what we’ve got:

device rw total (us) count avg (us) ram4 R 101628 981 103 ram5 R 99328 981 101 ram6 R 64973 974 66 ram2 R 57002 974 58 ram3 R 66635 974 68 ram0 R 101806 974 104 ram1 R 98470 974 101 ram7 R 64250 974 65 dm-0 R 48337401 974 49627 sda W 3871495 376 10296 sda R 125794 14 8985 device rw total (us) count avg (us) sda W 278560 18 15475

We see a strange device dm-0. Quick check:

$ dmsetup info /dev/dm-0 Name: delayed State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 0 Number of targets: 1

DeviceMapper’s “delayed” target that we saw previously. This target creates a block device that identically maps to disk but delays each request by given amount of milliseconds. This is a cause of our RAID problems: performance of a striped RAID is a performance of the slowest disk.

I’ve looked for other examples, but they mostly show which process generates the most I/O.

Let’s try to write our own script!

There is a tapset dedicated for I/O scheduler and block I/O. Let’s use probe::ioblock.end matching our RAID device and print backtrace.

probe ioblock.end { if (devname == "md0") { printf("%s: %d\n", devname, sector); print_backtrace() } }

Unfortunately, this won’t work because RAID device request end up in concrete disk, so we have to hook into raid0 module.

Dive into drivers/md/raid0.c and look to raid0_make_request. Core of the RAID 0 is encoded in these lines:

530 if (sectors < bio_sectors(bio)) { 531 split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set); 532 bio_chain(split, bio); 533 } else { 534 split = bio; 535 } 536 537 zone = find_zone(mddev->private, &(sector)); 538 tmp_dev = map_sector(mddev, zone, sector, &(sector)); 539 split->bi_bdev = tmp_dev->bdev; 540 split->bi_iter.bi_sector = sector + zone->dev_start + 541 tmp_dev->data_offset; ... 548 generic_make_request(split);
that tell us: “split bio requests to raid md device, map it to particular disk and issue generic_make_request”.

Closer look to generic_make_request

1966 do { 1967 struct request_queue *q = bdev_get_queue(bio->bi_bdev); 1968 1969 q->make_request_fn(q, bio); 1970 1971 bio = bio_list_pop(current->bio_list); 1972 } while (bio);
will show us that it gets request queue from block device, in our case a particular disk, and issue make_request_fn.

This will lead us to see which block devices our RAID consists of:

$ mdadm --misc -D /dev/md0 /dev/md0: Version : 1.2 Creation Time : Mon Nov 30 11:15:51 2015 Raid Level : raid0 Array Size : 3989504 (3.80 GiB 4.09 GB) Raid Devices : 8 Total Devices : 8 Persistence : Superblock is persistent Update Time : Mon Nov 30 11:15:51 2015 State : clean Active Devices : 8 Working Devices : 8 Failed Devices : 0 Spare Devices : 0 Chunk Size : 512K Name : alien:0 (local to host alien) UUID : d2960b14:bc29a1c5:040efdc6:39daf5cf Events : 0 Number Major Minor RaidDevice State 0 1 0 0 active sync /dev/ram0 1 1 1 1 active sync /dev/ram1 2 1 2 2 active sync /dev/ram2 3 1 3 3 active sync /dev/ram3 4 1 4 4 active sync /dev/ram4 5 1 5 5 active sync /dev/ram5 6 1 6 6 active sync /dev/ram6 7 253 0 7 active sync /dev/dm-0

and here we go – last device is our strange /dev/dm-0.

And again, I knew it from the beginning and tried to come into the root of the problem with SystemTap. But SystemTap was just a motivation to look into kernel code and think deeper, which is nice, though. This again proofs that the best tool to investigate any problem, be that performance issue or bug, is your brain and experience.

Multiboot kernel

2015-09-28T00:00:00+00:00

As a headcase, in my spare time (among other things) I’m writing an operating system kernel. There is nothing much at this moment because I’m digging into boot process of x86 system¹. And, to commit my knowledge so far, I’ll explain first simple but really important steps of booting trivial kernel.

The “kernel”

For illustrations I’m gonna use “Hello world” kernel that is written in NASM assembly (grab the source from github):

global start ; the entry symbol for ELF MAGIC_NUMBER equ 0x1BADB002 ; define the magic number constant FLAGS equ 0x0 ; multiboot flags CHECKSUM equ -MAGIC_NUMBER ; calculate the checksum ; (magic number + checksum + flags should equal 0) section .text: ; start of the text (code) section align 4 ; the code must be 4 byte aligned dd MAGIC_NUMBER ; write the magic number to the machine code, dd FLAGS ; the flags, dd CHECKSUM ; and the checksum start: ; the loader label (defined as entry point in linker script) mov ebx, 0xb8000 ; VGA area base mov ecx, 80*25 ; console size ; Clear screen mov edx, 0x0020; space symbol (0x20) on black background clear_loop: mov [ebx + ecx], edx dec ecx cmp ecx, -1 jnz clear_loop ; Print red 'A' mov eax, ( 4 << 8 | 0x41) ; 'A' symbol (0x41) print in red (0x4) mov [ebx], eax .loop: jmp .loop ; loop forever
This kernel works with VGA buffer - it clears the screen from the old BIOS messages and print capital ‘A’ letter in red. After it, it just loop forever.

Compile it with

nasm -f elf32 kernel.S -o kernel.o

nasm generates object file, which is NOT suitable for executing because its addresses need to be relocated from base address 0x0, combined with other section, resolve external symbols and so on. This is a job of the linker program.

When compiling program for userspace application gcc will invoke linker for you with default linker script. But for kernel space code you must provide your own link script that will tell where to put various sections of the code. Our kernel code has only .text section, no stack or heap, and multiboot header is hardcoded into .text section. So link script is pretty simple:

ENTRY(start) /* the name of the entry label */ SECTIONS { . = 0x00100000; /* the code should be loaded at 1 MB */ .text ALIGN (0x1000) : /* align at 4 KB */ { *(.text) /* all text sections from all files */ } }

I’ve already touched linking part in Restricting program memory article.

Basically, we’re saying “Start our code at 1MiB and put section .text in the beginning with 4K alignment. Entry point is start”.

Link like this:

ld -melf_i386 -T link.ld kernel.o -o kernel

And run kernel directly with QEMU:

$ qemu-system-i386 -kernel kernel

You’ve got it:

The multiboot part

When computer is being power up it starts executing code according to its “reset vector”. For modern x86 processors it is 0xFFFFFFF0. At this address motherboard sets jump instruction to the BIOS code. CPU is in “Real mode” (16 bit addressing with segmentation (up to 1MiB), no protection, no paging).

BIOS does all the usual work like scan for devices and initializes it and finds bootable device. After bootable device found it passes control to bootloader on this device.

Bootloader loads itself from disk (in case of multi-stage) finds kernel and load it into memory. In the dark old days every OS had its own format and rules, so there was a variaty of incompatible bootloaders. But now there is a Multiboot specification that gives your kernel some guarantees and amenities in exchange to comply the specification and provide Multiboot header.

Dependence on Multiboot specification is a big deal because it helps make the life MUCH easier and this is how:

Multiboot-compliant bootloader sets the system to well-defined state, most notably:

Transfer CPU to protected mode to allow you access all the memory

Enable A20 line - an old quirk to access additional segment in real mode

Global descriptor table and Interrupt descriptor table are undefined, so OS must setup its own

Multiboot-compliant OS kernels:

Can (and should) be in ELF format

Must set only 12 bytes to correctly boot

In general, booting multiboot compliant kernel is simple, especially if it’s in ELF format:

Multiboot bootloader search first 8K bytes of kernel image for Multiboot header (find it by magic 0x1BADB002)

If the image is in ELF format it loads section according to the section table

If the image is not in ELF format it loads kernel to address either supplied in address field or in the flags field.

In our kernel’s text section we’ve done it:

MAGIC_NUMBER equ 0x1BADB002 ; define the magic number constant FLAGS equ 0x0 ; multiboot flags CHECKSUM equ -MAGIC_NUMBER ; calculate the checksum ; (magic number + checksum + flags should equal 0) section .text: ; start of the text (code) section align 4 ; the code must be 4 byte aligned dd MAGIC_NUMBER ; write the magic number to the machine code, dd FLAGS ; the flags, dd CHECKSUM ; and the checksum
We didn’t specify any flags because we don’t need anything from bootloader like memory maps and stuff, and bootloader doesn’t need anything from us because we’re in ELF format. For other formats you must supply loading address in its multiboot header. Multiboot header is pretty simple:

The booting

Now lets boot our kernel like a serious guys.

First, we create ISO image with help of grub2-mkrescue. Create hierarchy like this:

isodir/ └── boot ├── grub │ └── grub.cfg └── kernel

Where grub.cfg is:

menuentry "kernel" { multiboot /boot/kernel }

And then invoke grub2-mkrescue:

grub2-mkrescue -o hello-kernel.iso isodir

And now we can boot it in any PC compatible machine:

qemu-system-i386 -cdrom hello-kernel.iso

We’ll see grub2 menu, where we can select our “kernel” and see the red ‘A’ letter.

Isn’t it great?

My brain hurts: all these real/protected mode, A20 line, segmentation, etc. are so quirky. I hope ARM booting is not that complicated. ↩︎

Perf

2015-09-09T00:00:00+00:00

Perf overview

Perf is a facility comprised of kernel infrastructure for gathering various events and userspace tool to get gathered data from the kernel and analyze it. It is like a gprof, but it is non-invasive, low-overhead and profile the whole stack, including your app, libraries, system calls AND kernel with CPU!

The perf tool supports a list of measurable events that you can view with perf list command. The tool and underlying kernel interface can measure events coming from different sources. For instance, some events are pure kernel counters, in this case, they are called software events. Examples include context-switches, minor-faults, page-faults and others.

Another source of events is the processor itself and its Performance Monitoring Unit (PMU). It provides a list of events to measure micro-architectural events such as the number of cycles, instructions retired, L1 cache misses and so on. Those events are called “PMU hardware events” or “hardware events” for short. They vary with each processor type and model - look at this Vince Weaver’s perf page for details

The “perf_events” interface also provides a small set of common hardware events monikers. On each processor, those events get mapped onto actual events provided by the CPU if they exist, otherwise, the event cannot be used. Somewhat confusingly, these are also called hardware events and hardware cache events.

Finally, there are also tracepoint events which are implemented by the kernel ftrace infrastructure. Those are only available with the 2.6.3x and newer kernels.

Thanks to such a variety of events and analysis abilities of userspace tool (see below) perf is a big fish in the world of tracing and profiling of Linux systems. It is a really versatile tool that may be used in several ways of which I know a few:

Record a profile and analyze it later: perf record + perf report

Gather statistics for application or system: perf stat

Real-time function-wise analysis: perf top

Trace application or system: perf trace

Each of these approaches includes a tremendous amount of possibilities for sorting, filtering, grouping and so on.

But as someone said, perf is a powerful tool with a little documentation. So in this article, I’ll try to share some of my knowledge about it.

Basic perf workflows

Preflight check

The first thing to do when you start working with Perf is to launch perf test. This will check your system and kernel features and report if something isn’t available. Usually, you need to make as much as possible “OK"s. Beware though that perf will behave differently when launched under “root” and ordinary user. It’s smart enough to let you do some things without root privileges. There is a control file at “/proc/sys/kernel/perf_event_paranoid” that you can tweak in order to change access to perf events:

$ perf stat -a Error: You may not have permission to collect system-wide stats. Consider tweaking /proc/sys/kernel/perf_event_paranoid: -1 - Not paranoid at all 0 - Disallow raw tracepoint access for unpriv 1 - Disallow cpu events for unpriv 2 - Disallow kernel profiling for unpriv

After you played with perf test, you can see what hardware events are available to you with perf list. Again, the list will differ depending on current user id. Also, a number of events will depend on your hardware: x86_64 CPUs have much more hardware events than some low-end ARM processors.

System statistics

Now to some real profiling. To check the general health of your system you can gather statistics with perf stat.

# perf stat -a sleep 5 Performance counter stats for 'system wide': 20005.830934 task-clock (msec) # 3.999 CPUs utilized (100.00%) 4,236 context-switches # 0.212 K/sec (100.00%) 160 cpu-migrations # 0.008 K/sec (100.00%) 2,193 page-faults # 0.110 K/sec 2,414,170,118 cycles # 0.121 GHz (83.35%) 4,196,068,507 stalled-cycles-frontend # 173.81% frontend cycles idle (83.34%) 3,735,211,886 stalled-cycles-backend # 154.72% backend cycles idle (66.68%) 2,109,428,612 instructions # 0.87 insns per cycle # 1.99 stalled cycles per insn (83.34%) 406,168,187 branches # 20.302 M/sec (83.32%) 6,869,950 branch-misses # 1.69% of all branches (83.32%) 5.003164377 seconds time elapsed

Here you can see how many context switches, migrations, page faults and other events happened during 5 seconds, along with some simple calculations. In fact, perf tool highlight statistics that you should worry about. In my case, it’s a stalled-cycles-frontend/backend. This counter shows how much time CPU pipeline is stalled (i.e. not advanced) due to some external cause like waiting for memory access.

Along with perf stat you have perf top - a top like utility, but that works symbol-wise.

# perf top -a --stdio PerfTop: 361 irqs/sec kernel:35.5% exact: 0.0% [4000Hz cycles], (all, 4 CPUs) ---------------------------------------------------------------------------------------- 2.06% libglib-2.0.so.0.4400.1 [.] g_mutex_lock 1.99% libglib-2.0.so.0.4400.1 [.] g_mutex_unlock 1.47% [kernel] [k] __fget 1.34% libpython2.7.so.1.0 [.] PyEval_EvalFrameEx 1.07% [kernel] [k] copy_user_generic_string 1.00% libpthread-2.21.so [.] pthread_mutex_lock 0.96% libpthread-2.21.so [.] pthread_mutex_unlock 0.85% libc-2.21.so [.] _int_malloc 0.83% libpython2.7.so.1.0 [.] PyParser_AddToken 0.82% [kernel] [k] do_sys_poll 0.81% libQtCore.so.4.8.6 [.] QMetaObject::activate 0.77% [kernel] [k] fput 0.76% [kernel] [k] __audit_syscall_exit 0.75% [kernel] [k] unix_stream_recvmsg 0.63% [kernel] [k] ia32_sysenter_target

Here you can see kernel functions, glib library functions, CPython functions, Qt framework functions and a pthread functions combined with its overhead. It’s a great tool to peek into system state to see what’s going on.

Application profiling

To profile particular application, either already running or not, you use perf record to collect events and then perf report to analyze program behavior. Let’s see:

# perf record -bag updatedb [ perf record: Woken up 259 times to write data ] [ perf record: Captured and wrote 65.351 MB perf.data (127127 samples) ]

Now dive into data with perf report:

# perf report

You will see a nice interactive TUI interface.

You can zoom into pid/thread

and see what’s going on there

You can look into nice assembly code (this looks almost as in radare)

and run scripts on it to see, for example, function calls histogram:

If it’s not enough to you, there are a lot of options both for perf record and perf report so play with it.

Other

In addition to that, you can find tools to profile kernel memory subsystem, locking, kvm guests, scheduling, do benchmarking and even create timecharts.

For illustration I’ll profile my simple block_hasher utility. Previously, I’ve profiled it with gprof and gcov, Valgrind and ftrace.

Hot spots profiling

When I was profiling my block_hasher util with gprof and gcov I didn’t see anything special related to application code, so I assume that it’s not an application code that makes it slow. Let’s see if perf can help us.

Start with perf stat giving options for detailed and scaled counters for CPU ("-dac”)

# perf stat -dac ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 1000 Performance counter stats for 'system wide': 32978.276562 task-clock (msec) # 4.000 CPUs utilized (100.00%) 6,349 context-switches # 0.193 K/sec (100.00%) 142 cpu-migrations # 0.004 K/sec (100.00%) 2,709 page-faults # 0.082 K/sec 20,998,366,508 cycles # 0.637 GHz (41.08%) 23,007,780,670 stalled-cycles-frontend # 109.57% frontend cycles idle (37.50%) 18,687,140,923 stalled-cycles-backend # 88.99% backend cycles idle (42.64%) 23,466,705,987 instructions # 1.12 insns per cycle # 0.98 stalled cycles per insn (53.74%) 4,389,207,421 branches # 133.094 M/sec (55.51%) 11,086,505 branch-misses # 0.25% of all branches (55.53%) 7,435,101,164 L1-dcache-loads # 225.455 M/sec (37.50%) 248,499,989 L1-dcache-load-misses # 3.34% of all L1-dcache hits (26.52%) 111,621,984 LLC-loads # 3.385 M/sec (28.77%) LLC-load-misses:HG 8.245518548 seconds time elapsed

Well, nothing really suspicious. 6K page context switches is OK because my machine is 2-core and I’m running 10 threads. 2K page faults is fine since we’re reading a lot of data from disks. Big stalled-cycles-frontend/backend is kind of outliers here since it’s still big on simple ls and --per-core statistics shows 0.00% percents of stalled-cycles.

Let’s collect profile:

# perf record -a -g -s -d -b ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 1000 [ perf record: Woken up 73 times to write data ] [ perf record: Captured and wrote 20.991 MB perf.data (33653 samples) ]

Options are:

-a for all CPUs

-g for call graphs

-s for per thread count

-d for sampling addresses. Not sure about this one, but it doesn’t affect profile

-b for sample any taken branches

Now show me the profile:

# perf report -g -T

Nothing much. I’ve looked into block_hasher threads, I’ve built a histogram, looked for vmlinux DSO, found instruction with most overhead

and still can’t say I found what’s wrong. That’s because there is no real overhead, nothing is spinning in vain. Something is just plain sleeping.

What we’ve done here and before in ftrace part is a hot spots analysis, i.e. we try to find places in our application or system that cause CPU to spin in useless cycles. Usually, it’s what you want but not today. We need to understand why pread is sleeping. And that’s what I call “latency profiling”.

Latency profiling

Record sched_stat and sched_switch events

When you search for perf documentation, the first thing you find is “Perf tutorial”. The “perf tutorial” page is almost entirely dedicated to the “hot spots” scenario, but, fortunately, there is an “Other scenarios” section with “Profiling sleep times” tutorial.

Profiling sleep times

This feature shows where and how long a program is sleeping or waiting something.

Whoa, that’s what we need!

Unfortunately scheduling stats profiling is not working by default. perf inject failing with

# perf inject -v -s -i perf.data.raw -o perf.data registering plugin: /usr/lib64/traceevent/plugins/plugin_kmem.so registering plugin: /usr/lib64/traceevent/plugins/plugin_mac80211.so registering plugin: /usr/lib64/traceevent/plugins/plugin_function.so registering plugin: /usr/lib64/traceevent/plugins/plugin_hrtimer.so registering plugin: /usr/lib64/traceevent/plugins/plugin_sched_switch.so registering plugin: /usr/lib64/traceevent/plugins/plugin_jbd2.so registering plugin: /usr/lib64/traceevent/plugins/plugin_cfg80211.so registering plugin: /usr/lib64/traceevent/plugins/plugin_scsi.so registering plugin: /usr/lib64/traceevent/plugins/plugin_xen.so registering plugin: /usr/lib64/traceevent/plugins/plugin_kvm.so overriding event (263) sched:sched_switch with new print handler build id event received for [kernel.kallsyms]: 8adbfad59810c80cb47189726415682e0734788a failed to write feature 2

The reason is that it can’t find in build-id cache scheduling stats symbols because CONFIG_SCHEDSTATS is disabled because it introduces some “non-trivial performance impact for context switches”. Details in Red Hat bugzilla Bug 1026506 and Bug 1013225. Debian kernels also don’t enable this option.

You can recompile kernel enabling “Collect scheduler statistics” in make menuconfig, but happy Fedora users can just install debug kernel:

dnf install kernel-debug kernel-debug-devel kernel-debug-debuginfo

Now, when everything works, we can give it a try:

# perf record -e sched:sched_stat_sleep -e sched:sched_switch -e sched:sched_process_exit -g -o perf.data.raw ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 1000 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.564 MB perf.data.raw (2001 samples) ] # perf inject -v -s -i perf.data.raw -o perf.data.sched registering plugin: /usr/lib64/traceevent/plugins/plugin_kmem.so registering plugin: /usr/lib64/traceevent/plugins/plugin_mac80211.so registering plugin: /usr/lib64/traceevent/plugins/plugin_function.so registering plugin: /usr/lib64/traceevent/plugins/plugin_hrtimer.so registering plugin: /usr/lib64/traceevent/plugins/plugin_sched_switch.so registering plugin: /usr/lib64/traceevent/plugins/plugin_jbd2.so registering plugin: /usr/lib64/traceevent/plugins/plugin_cfg80211.so registering plugin: /usr/lib64/traceevent/plugins/plugin_scsi.so registering plugin: /usr/lib64/traceevent/plugins/plugin_xen.so registering plugin: /usr/lib64/traceevent/plugins/plugin_kvm.so overriding event (266) sched:sched_switch with new print handler build id event received for /usr/lib/debug/lib/modules/4.1.6-200.fc22.x86_64+debug/vmlinux: c6e34bcb0ab7d65e44644ea2263e89a07744bf85 Using /root/.debug/.build-id/c6/e34bcb0ab7d65e44644ea2263e89a07744bf85 for symbols

But it’s really disappointing, I’ve expanded all callchains to see nothing:

# perf report --show-total-period -i perf.data.sched Samples: 10 of event 'sched:sched_switch', Event count (approx.): 31403254575 Children Self Period Command Shared Object Symbol - 100.00% 0.00% 0 block_hasher libpthread-2.21.so [.] pthread_join - pthread_join 0 - 100.00% 0.00% 0 block_hasher e34bcb0ab7d65e44644ea2263e89a07744bf85 [k] system_call system_call - pthread_join 0 - 100.00% 0.00% 0 block_hasher e34bcb0ab7d65e44644ea2263e89a07744bf85 [k] sys_futex sys_futex system_call - pthread_join 0 - 100.00% 0.00% 0 block_hasher e34bcb0ab7d65e44644ea2263e89a07744bf85 [k] do_futex do_futex sys_futex system_call - pthread_join 0 - 100.00% 0.00% 0 block_hasher e34bcb0ab7d65e44644ea2263e89a07744bf85 [k] futex_wait futex_wait do_futex sys_futex system_call - pthread_join 0 - 100.00% 0.00% 0 block_hasher e34bcb0ab7d65e44644ea2263e89a07744bf85 [k] futex_wait_queue_me futex_wait_queue_me futex_wait do_futex sys_futex system_call - pthread_join 0 - 100.00% 0.00% 0 block_hasher e34bcb0ab7d65e44644ea2263e89a07744bf85 [k] schedule schedule futex_wait_queue_me futex_wait do_futex sys_futex system_call - pthread_join 0 - 100.00% 100.00% 31403254575 block_hasher e34bcb0ab7d65e44644ea2263e89a07744bf85 [k] __schedule __schedule schedule futex_wait_queue_me futex_wait do_futex sys_futex system_call - pthread_join 0 - 14.52% 0.00% 0 block_hasher [unknown] [.] 0000000000000000 0

perf sched

Let’s see what else can we do. There is a perf sched command that has latency subcommand to “report the per task scheduling latencies and other scheduling properties of the workload”. Why not give it a shot?

# perf sched record -o perf.sched -g ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 1000 [ perf record: Woken up 6 times to write data ] [ perf record: Captured and wrote 13.998 MB perf.sched (56914 samples) ] # perf report -i perf.sched

I’ve inspected samples for sched_switch and sched_stat_runtime events (15K and 17K respectively) and found nothing. But then I looked in sched_stat_iowait.

and there I found really suspicious thing:

See? Almost all symbols come from “kernel.vmlinux” shared object, but one with strange name “0x000000005f8ccc27” comes from “dm_delay” object. Wait, what is “dm_delay”? Quick find gives us the answer:

> dm-delay > ======== > > Device-Mapper's "delay" target delays reads and/or writes > and maps them to different devices.

WHAT?! Delays reads and/or writes? Really?

# dmsetup info Name: delayed State: ACTIVE Read Ahead: 256 Tables present: LIVE Open count: 1 Event number: 0 Major, minor: 253, 0 Number of targets: 1 # dmsetup table delayed: 0 1000000 delay 1:7 0 30 # udevadm info -rq name /sys/dev/block/1:7 /dev/ram7

So, we have block device “/dev/ram7” mapped to DeviceMapper “delay” target to, well, delay I/O requests to 30 milliseconds. That’s why the whole RAID was slow - the performance of RAID0 is performance of the slowest disk in RAID.

Of course, I knew it from the beginning. I just wanted to see if I’ll be able to detect it with profiling tools. And in this case, I don’t think it’s fair to say that perf helped. Actually, perf gives a lot of confusion in the interface. Look at the picture above. What are these couple of dozens of lines with “99.67%” tell us? Which of these symbols cause latency? How to interpret it? If I wasn’t really attentive, like after a couple of hours of debugging and investigating, I couldn’t be able to notice it. If I issued the magic perf inject command it will collapse sched_stat_iowait command and I’ll not see dm-delay in top records.

Again, this is all are very confusing and it’s just a fortune that I’ve noticed it.

Conclusion

Perf is really versatile and extremely complex tool with a little documentation. On some simple cases it will help you a LOT. But a few steps from the mainstream problems and you are left alone with unintuitive data. We all need various documentation on perf - tutorials, books, slides, videos - that not only scratch the surface of it but gives a comprehensive overview of how it works, what it can do and what it doesn’t. I hope I have contributed to that purpose with this article (it took me half a year to write it).

References

Perf tutorial

Vince Weaver’s perf page

Beautiful Brendan Gregg’s “perf” page

Restricting program memory

2014-11-25T00:00:00+00:00

On the other day, I’ve decided to solve a popular problem: how to sort 1 million integers in 1 MiB?

But before I’ve even started to do anything I thought – how can I restrict process memory to 1 MiB? Will it work? So, here is the answers.

Process virtual memory

What you have to know before diving in various methods is how the process’s virtual memory is structured. There is a, hands down, the best article you could ever find about that is Gustavo Duarte’s “Anatomy of a Program in Memory”. His whole blog is a treasure.

After reading Gustavo’s article I can propose 2 possible options for restricting memory – reduce virtual address space and restrict heap size.

First is to limit the whole virtual address space for the process. This is nice and easy but not fully correct. We can’t limit whole virtual address space of a process to 1 MB – we won’t be able to map kernel and libs.

Second is to limit heap size. This is not so easy and seems like nobody tries to do this because the only reasonable way to do this is playing with the linker. But for limiting available memory to such small values like 1 MiB it will be absolutely correct.

Also, I will look at other methods like monitoring memory consumption with intercepting library and system calls related to memory management and changing program environment with emulation and sandboxing.

For testing and illustrating I will use this little program big_alloc that allocates (and frees) 100 MiB.

#include #include #include #include // 1000 allocation per 100 KiB = 100 000 KiB = 100 MiB #define NALLOCS 1000 #define ALLOC_SIZE 1024*100 // 100 KiB int main(int argc, const char *argv[]) { int i = 0; int **pp; bool failed = false; pp = malloc(NALLOCS * sizeof(int *)); for(i = 0; i < NALLOCS; i++) { pp[i] = malloc(ALLOC_SIZE); if (!pp[i]) { perror("malloc"); printf("Failed after %d allocations\n", i); failed = true; break; } // Touch some bytes in memory to trick copy-on-write. memset(pp[i], 0xA, 100); printf("pp[%d] = %p\n", i, pp[i]); } if (!failed) printf("Successfully allocated %d bytes\n", NALLOCS * ALLOC_SIZE); for(i = 0; i < NALLOCS; i++) { if (pp[i]) free(pp[i]); } free(pp); return 0; }
All the sources are on github.

ulimit

It’s the first thing that old unix hacker can think of when asked to limit program memory. ulimit is bash utility that allows you to restrict program resources and is just interface for setrlimit.

We can set the limit to resident memory size.

$ ulimit -m 1024

Now check:

$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 7802 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) 1024 open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 1024 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

We set the memory limit to 1024 kbytes (-m) thus 1 MiB. But when we try to run our program it won’t fail. Setting the limit to something more reasonable like 30 MiB will anyway let our program allocate 100 MB. ulimit simply doesn’t work. Despite setting the resident set size to 1024 kbytes, I can see in top that resident memory for my program is 4872.

The reason is that Linux doesn’t respect this and man ulimit tells it directly:

ulimit [-HSTabcdefilmnpqrstuvx [limit]] ... -m The maximum resident set size (many systems do not honor this limit) ...

There is also ulimit -d that is respected according to the kernel, but it still works because of mmap (see Linker chapter).

QEMU

When you want to modify program environment QEMU is the natural way for this kind of tasks. It has -R option to limit virtual address space. But like I said earlier you can’t restrict address space to small values – there will be no space to map libc and kernel.

Look:

$ qemu-i386 -R 1048576 ./big_alloc big_alloc: error while loading shared libraries: libc.so.6: failed to map segment from shared object: Cannot allocate memory

Here, -R 1048576 reserves 1 MiB for guest virtual address space.

For the whole virtual address space we have to set something more reasonable like 20 MB. Look:

$ qemu-i386 -R 20M ./big_alloc malloc: Cannot allocate memory Failed after 100 allocations

It successfully fails¹ after 100 allocations (10 MB).

So, QEMU is the first winner in restricting program’s memory size though you have to play with -R value to get the correct limit.

Container

Another option after QEMU is to launch an application in the container, restricting its resources. To do this you have several options:

Use fancy high-level docker.

Use regular usermode tools from lxc package.

Go hardcore and write your own script with libvirt.

Name it…

But after all, resources will be restricted with native Linux subsystem called cgroups. You can try to poke it directly but I suggest using lxc. I would like to use docker but it works only on 64-bit machines and my box is small Intel Atom netbook which is i386.

Ok, quick info. LXC is LinuX Containers. It’s a collection of userspace tools and libs for managing kernel facilities to create containers – isolated and secure environment for an application or the whole system.

Kernel facilities that provide such environment are:

Control groups (cgroups)

Kernel namespaces

chroot

Kernel capabilities

SELinux, AppArmor

Seccomp policies

You can find nice documentation on the official site, on the author’s blog and all over the internet.

To simply run an application in the container you have to provide config to lxc-execute where you will configure your container. Every sane person should start from examples in /usr/share/doc/lxc/examples. Man pages recommend starting with lxc-macvlan.conf. Ok, let’s do this:

# cp /usr/share/doc/lxc/examples/lxc-macvlan.conf lxc-my.conf # lxc-execute -n foo -f ./lxc-my.conf ./big_alloc Successfully allocated 102400000 bytes

It works!

Now let’s limit memory. This is what cgroup for. LXC allows you to configure memory subsystem for container’s cgroup by setting memory limits.

You can find available tunable parameters for the memory subsystem in this fine RedHat manual. I’ve found 2:

memory.limit_in_bytes – sets the maximum amount of user memory (including file cache)

memory.memsw.limit_in_bytes – sets the maximum amount for the sum of memory and swap usage

Here is what I added to lxc-my.conf:

lxc.cgroup.memory.limit_in_bytes = 2M lxc.cgroup.memory.memsw.limit_in_bytes = 2M

Launch again:

# lxc-execute -n foo -f ./lxc-my.conf ./big_alloc #

Nothing happened, looks like it’s way too small memory. Let’s try to launch it from the shell in the container.

# lxc-execute -n foo -f ./lxc-my.conf /bin/bash #

Looks like bash failed to launch. Let’s try /bin/sh:

# lxc-execute -n foo -f ./lxc-my.conf -l DEBUG -o log /bin/sh sh-4.2# ./dev/big_alloc/big_alloc Killed

Yay! We can see this nice act of killing in dmesg:

[15447.035569] big_alloc invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 ... [15447.035779] Task in /lxc/foo [15447.035785] killed as a result of limit of [15447.035789] /lxc/foo [15447.035795] memory: usage 3072kB, limit 3072kB, failcnt 127 [15447.035800] memory+swap: usage 3072kB, limit 3072kB, failcnt 0 [15447.035805] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0 [15447.035808] Memory cgroup stats for /lxc/foo: cache:32KB rss:3040KB rss_huge:0KB mapped_file:0KB writeback:0KB swap:0KB inactive_anon:1588KB active_anon:1448KB inactive_file:16KB active_file:16KB unevictable:0KB [15447.035836] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [15447.035963] [ 9225] 0 9225 942 308 10 0 0 init.lxc [15447.035971] [ 9228] 0 9228 833 698 6 0 0 sh [15447.035978] [ 9252] 0 9252 16106 843 36 0 0 big_alloc [15447.035983] Memory cgroup out of memory: Kill process 9252 (big_alloc) score 1110 or sacrifice child [15447.035990] Killed process 9252 (big_alloc) total-vm:64424kB, anon-rss:2396kB, file-rss:976kB

Though we haven’t seen error message from big_alloc about malloc failure and how much memory we were able to get, I think we’ve successfully restricted memory via container technology and can stop with it for now.

Linker

Now, let’s try to modify binary image limiting space available for the heap.

Linking is the final part of building a program and it implies using linker and linker script. Linker script is the description of program sections in memory along with its attributes and stuff.

Here is a simple linker script:

ENTRY(main) SECTIONS { . = 0x10000; .text : { *(.text) } . = 0x8000000; .data : { *(.data) } .bss : { *(.bss) } }

Dot is current location. What that script tells us is that .text section starts at address 0x10000, and then starting from 0x8000000 we have 2 subsequent sections .data and .bss. Entry point is main.

Nice and sweet but it will not work for any useful applications. And the reason is that main function that you write in C programs is not actually first function being called. There is a whole lot of initialization and cleanup code. That code is provided with C runtime (also shorthanded to crt) and spread into crt#.o libraries in /usr/lib.

You can see exact details if you launch gcc with -v option. You’ll see that at first it invokes cc1 and creates assembly, then translate it to object file with as and finally combines everything in ELF file with collect2. That collect2 is ld wrapper. It takes your object file and 5 additional libs to create the final binary image:

/usr/lib/gcc/i686-redhat-linux/4.8.3/../../../crt1.o

/usr/lib/gcc/i686-redhat-linux/4.8.3/../../../crti.o

/usr/lib/gcc/i686-redhat-linux/4.8.3/crtbegin.o

/tmp/ccEZwSgF.o <-- This one is our program object file

/usr/lib/gcc/i686-redhat-linux/4.8.3/crtend.o

/usr/lib/gcc/i686-redhat-linux/4.8.3/../../../crtn.o

It’s really complicated so instead of writing my own script I’ll modify default linker script. Get default linker script passing -Wl,-verbose to gcc:

gcc big_alloc.c -o big_alloc -Wl,-verbose

Now let’s figure out how to modify it. Let’s see how our binary is built by default. Compile it and look for .data section address. Here is objdump -h big_alloc output

Sections: Idx Name Size VMA LMA File off Algn ... 12 .text 000002e4 080483e0 080483e0 000003e0 2**4 CONTENTS, ALLOC, LOAD, READONLY, CODE ... 23 .data 00000004 0804a028 0804a028 00001028 2**2 CONTENTS, ALLOC, LOAD, DATA 24 .bss 00000004 0804a02c 0804a02c 0000102c 2**2 ALLOC

.text, .data and .bss sections are located near 128 MiB.

Now, let’s see where is the stack with help of gdb:

[restrict-memory]$ gdb big_alloc ... Reading symbols from big_alloc...done. (gdb) break main Breakpoint 1 at 0x80484fa: file big_alloc.c, line 12. (gdb) r Starting program: /home/avd/dev/restrict-memory/big_alloc Breakpoint 1, main (argc=1, argv=0xbffff164) at big_alloc.c:12 12 int i = 0; Missing separate debuginfos, use: debuginfo-install glibc-2.18-16.fc20.i686 (gdb) info registers eax 0x1 1 ecx 0x9a8fc98f -1701852785 edx 0xbffff0f4 -1073745676 ebx 0x42427000 1111650304 esp 0xbffff0a0 0xbffff0a0 ebp 0xbffff0c8 0xbffff0c8 esi 0x0 0 edi 0x0 0 eip 0x80484fa 0x80484fa eflags 0x286 [ PF SF IF ] cs 0x73 115 ss 0x7b 123 ds 0x7b 123 es 0x7b 123 fs 0x0 0 gs 0x33 51

esp points to 0xbffff0a0 which is near 3 GiB. So we have ~2.9 GiB for heap.

In the real world, stack top address is randomized, e.g. you can see it in the output of

# cat /proc/self/maps

As we all know, heap grows up from the end of .data towards the stack. What if we move .data section to the highest possible address?

Let’s put data segment 2 MiB before stack. Take stack top, subtract 2 MiB:

0xbffff0a0 - 0x200000 = 0xbfdff0a0

Now shift all sections starting with .data to that address:

. = 0xbfdff0a0 .data : { *(.data .data.* .gnu.linkonce.d.*) SORT(CONSTRUCTORS) }

Compile it:

$ gcc big_alloc.c -o big_alloc -Wl,-T hack.lst

-Wl is an option to linker and -T hack.lst is a linker option itself. It tells linker to use hack.lst as a linker script.

Now, if we look at header we’ll see that:

Sections: Idx Name Size VMA LMA File off Algn ... 23 .data 00000004 bfdff0a0 bfdff0a0 000010a0 2**2 CONTENTS, ALLOC, LOAD, DATA 24 .bss 00000004 bfdff0a4 bfdff0a4 000010a4 2**2 ALLOC

But nevertheless, it successfully allocates. How? That’s really neat. When I tried to look at pointer values that malloc returns I saw that allocation is starting somewhere over the end of .data section like 0xbf8b7000, continues for some time with increasing pointers and then resets pointers to lower address like 0xb7676000. From that address it will allocate for some time with pointers increasing and then resets pointers again to even lower address like 0xb5e76000. Eventually, it looks like heap growing down!

But if you think for a minute it doesn’t really that strange. I’ve examined some glibc sources and found out that when brk fails it will use mmap instead. So glibc asks the kernel to map some pages, kernel sees that process has lots of holes in virtual memory space and map page from that space for glibc, and finally glibc returns pointer from that page.

Running big_alloc under strace confirmed theory. Just look at normal binary:

brk(0) = 0x8135000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77df000 mmap2(NULL, 95800, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb77c7000 mmap2(0x4226d000, 1825436, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4226d000 mmap2(0x42425000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b8000) = 0x42425000 mmap2(0x42428000, 10908, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x42428000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77c6000 mprotect(0x42425000, 8192, PROT_READ) = 0 mprotect(0x8049000, 4096, PROT_READ) = 0 mprotect(0x42269000, 4096, PROT_READ) = 0 munmap(0xb77c7000, 95800) = 0 brk(0) = 0x8135000 brk(0x8156000) = 0x8156000 brk(0) = 0x8156000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77de000 brk(0) = 0x8156000 brk(0x8188000) = 0x8188000 brk(0) = 0x8188000 brk(0x81ba000) = 0x81ba000 brk(0) = 0x81ba000 brk(0x81ec000) = 0x81ec000 ... brk(0) = 0x9c19000 brk(0x9c4b000) = 0x9c4b000 brk(0) = 0x9c4b000 brk(0x9c7d000) = 0x9c7d000 brk(0) = 0x9c7d000 brk(0x9caf000) = 0x9caf000 ... brk(0) = 0xe29c000 brk(0xe2ce000) = 0xe2ce000 brk(0) = 0xe2ce000 brk(0xe300000) = 0xe300000 brk(0) = 0xe300000 brk(0) = 0xe300000 brk(0x8156000) = 0x8156000 brk(0) = 0x8156000 +++ exited with 0 +++

and now the modified binary

brk(0) = 0xbf896000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb778f000 mmap2(NULL, 95800, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7777000 mmap2(0x4226d000, 1825436, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4226d000 mmap2(0x42425000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b8000) = 0x42425000 mmap2(0x42428000, 10908, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x42428000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7776000 mprotect(0x42425000, 8192, PROT_READ) = 0 mprotect(0x8049000, 4096, PROT_READ) = 0 mprotect(0x42269000, 4096, PROT_READ) = 0 munmap(0xb7777000, 95800) = 0 brk(0) = 0xbf896000 brk(0xbf8b7000) = 0xbf8b7000 brk(0) = 0xbf8b7000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb778e000 brk(0) = 0xbf8b7000 brk(0xbf8e9000) = 0xbf8e9000 brk(0) = 0xbf8e9000 brk(0xbf91b000) = 0xbf91b000 brk(0) = 0xbf91b000 brk(0xbf94d000) = 0xbf94d000 brk(0) = 0xbf94d000 brk(0xbf97f000) = 0xbf97f000 ... brk(0) = 0xbff8e000 brk(0xbffc0000) = 0xbffc0000 brk(0) = 0xbffc0000 brk(0xbfff2000) = 0xbffc0000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7676000 brk(0) = 0xbffc0000 brk(0xbfffa000) = 0xbffc0000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7576000 brk(0) = 0xbffc0000 brk(0xbfffa000) = 0xbffc0000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7476000 brk(0) = 0xbffc0000 brk(0xbfffa000) = 0xbffc0000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7376000 ... brk(0) = 0xbffc0000 brk(0xbfffa000) = 0xbffc0000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb1c76000 brk(0) = 0xbffc0000 brk(0xbfffa000) = 0xbffc0000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb1b76000 brk(0) = 0xbffc0000 brk(0xbfffa000) = 0xbffc0000 mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb1a76000 brk(0) = 0xbffc0000 brk(0) = 0xbffc0000 brk(0) = 0xbffc0000 ... brk(0) = 0xbffc0000 brk(0) = 0xbffc0000 brk(0) = 0xbffc0000 +++ exited with 0 +++

That being said, shifting .data section up to stack (thus reducing space for heap) is pointless because kernel will map page for malloc from virtual memory empty area.

Sandbox

The other way to restrict program memory is sandboxing. The difference from emulation is that we’re not really emulating anything but instead, we track and control certain things in program behavior. Usually sandboxing is used for security research when you have some kind of malware and need to analyze it without harming your system.

I’ve come up with several sandboxing methods and implemented most promising.

LD_PRELOAD trick

LD_PRELOAD is the special environment variable that when set will make dynamic linker use “preloaded” library before any other, including libc, library. It’s used in a lot of scenarios from debugging to, well, sandboxing.

This trick is also infamously used by some malware.

I have written simple memory management sandbox that intercepts malloc/free calls, does a memory usage accounting and returns ENOMEM if memory limit is exceeded.

To do this I have written a shared library with my own malloc/free wrappers that will increment a counter by malloc size and decrement when free is called. This library is being preloaded with LD_PRELOAD when running an application under test.

Here is my malloc implementation.

void *malloc(size_t size) { void *p = NULL; if (libc_malloc == NULL) save_libc_malloc(); if (mem_allocated <= MEM_THRESHOLD) { p = libc_malloc(size); } else { errno = ENOMEM; return NULL; } if (!no_hook) { no_hook = 1; account(p, size); no_hook = 0; } return p; }
libc_malloc is a pointer to original malloc from the libc. no_hook is a thread-local flag. It’s is used to be able to use malloc in malloc hooks and avoid recursive calls - an idea taken from Tetsuyuki Kobayashi presentation.

malloc is used implicitly in account function by uthash hash table library. Why use a hash table? It’s because when you call free you pass to it only the pointer and in free you don’t know how much memory has been allocated. So I have the hash table with a pointer as a key and allocated size as a value. Here is what I do on malloc:

struct malloc_item *item, *out; item = malloc(sizeof(*item)); item->p = ptr; item->size = size; HASH_ADD_PTR(HT, p, item); mem_allocated += size; fprintf(stderr, "Alloc: %p -> %zu\n", ptr, size);
mem_allocated is that static variable that is compared against threshold in malloc.

Now when free is called here is what happened:

struct malloc_item *found; HASH_FIND_PTR(HT, &ptr, found); if (found) { mem_allocated -= found->size; fprintf(stderr, "Free: %p -> %zu\n", found->p, found->size); HASH_DEL(HT, found); free(found); } else { fprintf(stderr, "Freeing unaccounted allocation %p\n", ptr); }
Yep, just decrement mem_allocated. It’s that simple.

But the really cool thing is that it works rock solid².

[restrict-memory]$ LD_PRELOAD=./libmemrestrict.so ./big_alloc pp[0] = 0x25ac210 pp[1] = 0x25c5270 pp[2] = 0x25de2d0 pp[3] = 0x25f7330 pp[4] = 0x2610390 pp[5] = 0x26293f0 pp[6] = 0x2642450 pp[7] = 0x265b4b0 pp[8] = 0x2674510 pp[9] = 0x268d570 pp[10] = 0x26a65d0 pp[11] = 0x26bf630 pp[12] = 0x26d8690 pp[13] = 0x26f16f0 pp[14] = 0x270a750 pp[15] = 0x27237b0 pp[16] = 0x273c810 pp[17] = 0x2755870 pp[18] = 0x276e8d0 pp[19] = 0x2787930 pp[20] = 0x27a0990 malloc: Cannot allocate memory Failed after 21 allocations

Full source code for library is on github

So, LD_PRELOAD is a great way to restrict memory!

ptrace

ptrace is another feature that can be used to build memory sandboxing. ptrace is a system call that allows you to control the execution of another process. It’s built into various POSIX operating system including, of course, Linux.

ptrace is the foundation of tracers like strace, ltrace, almost every sandboxing software like systrace, sydbox, mbox and all debuggers including gdb itself.

I have built custom tool with ptrace. It traces brk calls and looks for the distance between the initial program break value and new value set by the next brk call.

This tool forks and becomes 2 processes. The parent process is a tracer and child process is a tracee. In a child process I call ptrace(PTRACE_TRACEME) and then execv. In a parent I use ptrace(PTRACE_SYSCALL) to stop on syscall and filter brk calls from child and then another ptrace(PTRACE_SYSCALL) to get brk return value.

When brk exceeded threshold I set -ENOMEM as brk return value. This is set in eax register so I just overwrite it with ptrace(PTRACE_SETREGS). Here is the meaty part:

// Get return value if (!syscall_trace(pid, &state)) { dbg("brk return: 0x%08X, brk_start 0x%08X\n", state.eax, brk_start); if (brk_start) // We have start of brk { diff = state.eax - brk_start; // If child process exceeded threshold // replace brk return value with -ENOMEM if (diff > THRESHOLD || threshold) { dbg("THRESHOLD!\n"); threshold = true; state.eax = -ENOMEM; ptrace(PTRACE_SETREGS, pid, 0, &state); } else { dbg("diff 0x%08X\n", diff); } } else { dbg("Assigning 0x%08X to brk_start\n", state.eax); brk_start = state.eax; } }
Also, I intercept mmap/mmap2 calls because libc is smart enough to call it when brk failed. So when I have threshold exceeded and see mmap calls I just fail it with ENOMEM.

It works!

[restrict-memory]$ ./ptrace-restrict ./big_alloc pp[0] = 0x8958fb0 pp[1] = 0x8971fb8 pp[2] = 0x898afc0 pp[3] = 0x89a3fc8 pp[4] = 0x89bcfd0 pp[5] = 0x89d5fd8 pp[6] = 0x89eefe0 pp[7] = 0x8a07fe8 pp[8] = 0x8a20ff0 pp[9] = 0x8a39ff8 pp[10] = 0x8a53000 pp[11] = 0x8a6c008 pp[12] = 0x8a85010 pp[13] = 0x8a9e018 pp[14] = 0x8ab7020 pp[15] = 0x8ad0028 pp[16] = 0x8ae9030 pp[17] = 0x8b02038 pp[18] = 0x8b1b040 pp[19] = 0x8b34048 pp[20] = 0x8b4d050 malloc: Cannot allocate memory Failed after 21 allocations

But… I don’t really like it. It’s ABI specific, i.e. it has to use rax instead of eax on 64-bit machine, so either I make different version of that tool or use #ifdef to cope with ABI differences or make you build it with -m32 option. But that’s not usable. Also it probably won’t work on other POSIX like systems, because they might have different ABI.

Other

There are also other things one may try which I rejected for different reasons:

malloc hooks. Deprecated as said man page so I didn’t bother trying it.

Seccomp and prctl with PR_SET_MM_START_BRK. This might work but as said in seccomp filtering kernel documentation it’s not a sandboxing but a “mechanism for minimizing the exposed kernel surface”. So I guess it will be even more awkward than using ptrace by hand. Though I might look at it sometime.

libvirt-sandbox. Nope, it’s just a wrapper over lxc and qemu.

SELinux sandbox. Nope. Just doesn’t work though it uses cgroup.

Recap

In the end, I’d like to recap:

There are a lot of ways to restricting memory:

Resource limiting with ulimit and cgroup

Running under an emulator like QEMU

Sandboxing with LD_PRELOAD and ptrace

Modifying segments in the binary image.

But not all of them are working

ulimit doesn’t work.

cgroup kinda works - crashing application

Emulating works - crashing application

LD_PRELOAD works amazing!

ptrace works good enough but ABI dependant

Linker magic doesn’t work because ingenious libc calls mmap.

References

Gustavo Duarte’s article again.

Limiting time and memory consumption of a program in Linux.

Linux sandboxing

I think I’ve just invented a new term for QA guys. ↩︎

Unless application itself uses LD_PRELOAD :-\ ↩︎

Ftrace

2014-10-27T00:00:00+00:00

ftrace

Ftrace is a framework for tracing and profiling Linux kernel with the following features:

Kernel functions tracing

Call graph tracing

Tracepoints support

Dynamic tracing via kprobes

Statistics for kernel functions

Statistics for kernel events

Essentially, ftrace built around smart lockless ring buffer implementation (see Documentation/trace/ring-buffer-design.txt/). That buffer stores all ftrace info and imported via debugfs¹ in /sys/kernel/debug/tracing/. All manipulations are done with simple files operations in this directory.

How ftrace works

As I’ve just said, ftrace is a framework meaning that it provides only ring buffer, all real work is done by so called tracers. Currently, ftrace includes next tracers:

function – default tracer;

function_graph – constructs call graph;

irqsoff, preempoff, preemptirqsoff, wakeup, wakeup_rt – latency tracers. These are origins of ftrace, they were presented in -rt kernel. I won’t give you lot of info on this topic cause it’s more about realtime, scheduling and hardware stuff;

nop – you guess.

Also, as additional features you’ll get:

kernel tracepoints support;

kprobes support;

blktrace support, though it’s going to be deleted.

Now let’s look at specific tracers.

Function tracing

Main ftrace function is, well, functions tracing (function and function_graph tracers). To achieve this, kernel function instrumented with mcount calls, just like with gprof. But kernel mcount, of course, totally differs from userspace, because it’s architecture dependent. This dependency is required to be able to build call graphs, and more specific to get caller address from previous stack frame.

This mcount is inserted in function prologue and if it’s turned off it’s doing nothing. But if it’s turned on then it’s calling ftrace function that depending on current tracer writes different data to ring buffer.

Events tracing

Events tracing is done with help of tracepoints. You set event via set_event file in /sys/kernel/debug/tracing and then it will be traced in the ring buffer. For example, to trace kmalloc, just issue

echo kmalloc > /sys/kernel/debug/tracing/set_event

and now you can see in trace:

tail-7747 [000] .... 12584.876544: kmalloc: call_site=c06c56da ptr=e9cf9eb0 bytes_req=4 bytes_alloc=8 gfp_flags=GFP_KERNEL|GFP_ZERO

and it’s the same as in include/trace/events/kmem.h, meaning it’s just a tracepoint.

kprobes tracing

In kernel 3.10 there was added support for kprobes and kretprobes for ftrace. Now you can do dynamic tracing without writing your own kernel module. But, unfortunately, there is nothing much to do with it, just

Registers values

Memory dumps

Symbols values

Stack values

Return values (kretprobes)

And again, this output is written to ring buffer. Also, you can calculate some statistic over it.

Let’s trace something that doesn’t have tracepoint like something not from the kernel but from the kernel module.

On my Samsung N210 laptop I have ath9k WiFi module that’s most likely doesn’t have any tracepoints. To check this just grep available_events:

[tracing]# grep ath available_events cfg80211:rdev_del_mpath cfg80211:rdev_add_mpath cfg80211:rdev_change_mpath cfg80211:rdev_get_mpath cfg80211:rdev_dump_mpath cfg80211:rdev_return_int_mpath_info ext4:ext4_ext_convert_to_initialized_fastpath

Let’s see what functions can we put probe on:

[tracing]# grep "\[ath9k\]" /proc/kallsyms | grep ' t ' | grep rx f82e6ed0 t ath_rx_remove_buffer [ath9k] f82e6f60 t ath_rx_buf_link.isra.25 [ath9k] f82e6ff0 t ath_get_next_rx_buf [ath9k] f82e7130 t ath_rx_edma_buf_link [ath9k] f82e7200 t ath_rx_addbuffer_edma [ath9k] f82e7250 t ath_rx_edma_cleanup [ath9k] f82f3720 t ath_debug_stat_rx [ath9k] f82e7a70 t ath_rx_tasklet [ath9k] f82e7310 t ath_rx_cleanup [ath9k] f82e7800 t ath_calcrxfilter [ath9k] f82e73e0 t ath_rx_init [ath9k]

(First grep filters symbols from ath9k module, second grep filters functions which reside in text section and last grep filters receiver functions).

For example, we will trace ath_get_next_rx_buf function.

[tracing]# echo 'r:ath_probe ath9k:ath_get_next_rx_buf $retval' >> kprobe_events

This command is not from top of my head – check Documentation/tracing/kprobetrace.txt

This puts retprobe on our function and fetches return value (it’s just a pointer).

After we’ve put probe we must enable it:

[tracing]# echo 1 > events/kprobes/enable

And then we can look for output in trace file and here it is:

midori-6741 [000] d.s. 3011.304724: ath_probe: (ath_rx_tasklet+0x35a/0xc30 [ath9k] <- ath_get_next_rx_buf) arg1=0xf6ae39f4

Example (block_hasher)

By default, ftrace is collecting info about all kernel functions and that’s huge. But, being a sophisticated kernel mechanism, ftrace has a lot of features, many kinds of options, tunable params and so on for which I don’t have a feeling to talk about because there are plenty of manuals and articles on lwn (see To read section). Hence, it’s no wonder that we can, for example, filter by PID. Here is the script:

#!/bin/sh DEBUGFS=`grep debugfs /proc/mounts | awk '{ print $2; }'` # Reset trace stat echo 0 > $DEBUGFS/tracing/function_profile_enabled echo 1 > $DEBUGFS/tracing/function_profile_enabled echo $$ > $DEBUGFS/tracing/set_ftrace_pid echo function > $DEBUGFS/tracing/current_tracer exec $*

function_profile_enabled configures collecting statistical info.

Launch our magic script

./ftrace-me ./block_hasher -d /dev/md127 -b 1048576 -t10 -n10000

get per-processor statistics from files in tracing/trace_stat/

head -n50 tracing/trace_stat/function* > ~/trace_stat

and see top 5

==> function0 <== Function Hit Time Avg -------- --- ---- --- schedule 444425 8653900277 us 19472.12 us schedule_timeout 36019 813403521 us 22582.62 us do_IRQ 8161576 796860573 us 97.635 us do_softirq 486268 791706643 us 1628.128 us __do_softirq 486251 790968923 us 1626.667 us ==> function1 <== Function Hit Time Avg -------- --- ---- --- schedule 1352233 13378644495 us 9893.742 us schedule_hrtimeout_range 11853 2708879282 us 228539.5 us poll_schedule_timeout 7733 2366753802 us 306058.9 us schedule_timeout 176343 1857637026 us 10534.22 us schedule_timeout_interruptible 95 1637633935 us 17238251 us ==> function2 <== Function Hit Time Avg -------- --- ---- --- schedule 1260239 9324003483 us 7398.599 us vfs_read 215859 884716012 us 4098.582 us do_sync_read 214950 851281498 us 3960.369 us sys_pread64 13136 830103896 us 63193.04 us generic_file_aio_read 14955 830034649 us 55502.14 us

(Don’t pay attention to schedule – it’s just calls of scheduler).

Most of the time we are spending in schedule, do_IRQ, schedule_hrimeout_range and vfs_read meaning that we either waiting for reading or waiting for some timeout. Now that’s strange! To make it clearer we can disable so called graph time so that child functions wouldn’t be counted. Let me explain, by default ftrace counting function time as a time of function itself plus all subroutines calls. That’s and graph_time option in ftrace.

Tell

echo 0 > options/graph_time

And collect profile again

==> function0 <== Function Hit Time Avg -------- --- ---- --- schedule 34129 6762529800 us 198146.1 us mwait_idle 50428 235821243 us 4676.394 us mempool_free 59292718 27764202 us 0.468 us mempool_free_slab 59292717 26628794 us 0.449 us bio_endio 49761249 24374630 us 0.489 us ==> function1 <== Function Hit Time Avg -------- --- ---- --- schedule 958708 9075670846 us 9466.564 us mwait_idle 406700 391923605 us 963.667 us _spin_lock_irq 22164884 15064205 us 0.679 us __make_request 3890969 14825567 us 3.810 us get_page_from_freelist 7165243 14063386 us 1.962 us

Now we see amusing mwait_idle that somebody is somehow calling. We can’t say how does it happen.

Maybe, we should get a function graph! We know that it all starts with pread so let’s try to trace down function calls from pread.

By that moment, I had tired to read/write to debugfs files and started to use CLI interface to ftrace which is trace-cmd.

Using trace-cmd is dead simple – first, you’re recording with trace-cmd record and then analyze it with trace-cmd report.

Record:

trace-cmd record -p function_graph -o graph_pread.dat -g sys_pread64 \ ./block_hasher -d /dev/md127 -b 1048576 -t10 -n100

Look:

trace-cmd report -i graph_pread.dat | less

And it’s disappointing.

block_hasher-4102 [001] 2764.516562: funcgraph_entry: | __page_cache_alloc() { block_hasher-4102 [001] 2764.516562: funcgraph_entry: | alloc_pages_current() { block_hasher-4102 [001] 2764.516562: funcgraph_entry: 0.052 us | policy_nodemask(); block_hasher-4102 [001] 2764.516563: funcgraph_entry: 0.058 us | policy_zonelist(); block_hasher-4102 [001] 2764.516563: funcgraph_entry: | __alloc_pages_nodemask() { block_hasher-4102 [001] 2764.516564: funcgraph_entry: 0.054 us | _cond_resched(); block_hasher-4102 [001] 2764.516564: funcgraph_entry: 0.063 us | next_zones_zonelist(); block_hasher-4109 [000] 2764.516564: funcgraph_entry: | SyS_pread64() { block_hasher-4102 [001] 2764.516564: funcgraph_entry: | get_page_from_freelist() { block_hasher-4109 [000] 2764.516564: funcgraph_entry: | __fdget() { block_hasher-4102 [001] 2764.516565: funcgraph_entry: 0.052 us | next_zones_zonelist(); block_hasher-4109 [000] 2764.516565: funcgraph_entry: | __fget_light() { block_hasher-4109 [000] 2764.516565: funcgraph_entry: 0.217 us | __fget(); block_hasher-4102 [001] 2764.516565: funcgraph_entry: 0.046 us | __zone_watermark_ok(); block_hasher-4102 [001] 2764.516566: funcgraph_entry: 0.057 us | __mod_zone_page_state(); block_hasher-4109 [000] 2764.516566: funcgraph_exit: 0.745 us | } block_hasher-4109 [000] 2764.516566: funcgraph_exit: 1.229 us | } block_hasher-4102 [001] 2764.516566: funcgraph_entry: | zone_statistics() { block_hasher-4109 [000] 2764.516566: funcgraph_entry: | vfs_read() { block_hasher-4102 [001] 2764.516566: funcgraph_entry: 0.064 us | __inc_zone_state(); block_hasher-4109 [000] 2764.516566: funcgraph_entry: | rw_verify_area() { block_hasher-4109 [000] 2764.516567: funcgraph_entry: | security_file_permission() { block_hasher-4102 [001] 2764.516567: funcgraph_entry: 0.057 us | __inc_zone_state(); block_hasher-4109 [000] 2764.516567: funcgraph_entry: 0.048 us | cap_file_permission(); block_hasher-4102 [001] 2764.516567: funcgraph_exit: 0.907 us | } block_hasher-4102 [001] 2764.516567: funcgraph_entry: 0.056 us | bad_range(); block_hasher-4109 [000] 2764.516567: funcgraph_entry: 0.115 us | __fsnotify_parent(); block_hasher-4109 [000] 2764.516568: funcgraph_entry: 0.159 us | fsnotify(); block_hasher-4102 [001] 2764.516568: funcgraph_entry: | mem_cgroup_bad_page_check() { block_hasher-4102 [001] 2764.516568: funcgraph_entry: | lookup_page_cgroup_used() { block_hasher-4102 [001] 2764.516568: funcgraph_entry: 0.052 us | lookup_page_cgroup(); block_hasher-4109 [000] 2764.516569: funcgraph_exit: 1.958 us | } block_hasher-4102 [001] 2764.516569: funcgraph_exit: 0.435 us | } block_hasher-4109 [000] 2764.516569: funcgraph_exit: 2.487 us | } block_hasher-4102 [001] 2764.516569: funcgraph_exit: 0.813 us | } block_hasher-4102 [001] 2764.516569: funcgraph_exit: 4.666 us | }

First of all, there is no straight function call chain, it’s constantly interrupted and transferred to another CPU. Secondly, there are a lot of noise e.g. inc_zone_state and __page_cache_alloc calls. And finally, there are neither mdraid function nor mwait_idle calls!

And the reasons are ftrace default sources (tracepoints) and async/callback nature of kernel code. You won’t see direct functions call chain from sys_pread64, the kernel doesn’t work this way.

But what if we setup kprobes on mdraid functions? No problem! Just add return probes for mwait_idle and md_make_request:

# echo 'r:md_make_request_probe md_make_request $retval' >> kprobe_events # echo 'r:mwait_probe mwait_idle $retval' >> kprobe_events

Repeat the routine with trace-cmd to get function graph:

# trace-cmd record -p function_graph -o graph_md.dat -g md_make_request -e md_make_request_probe -e mwait_probe -F \ ./block_hasher -d /dev/md0 -b 1048576 -t10 -n100

-e enables particular event. Now, look at function graph:

block_hasher-28990 [000] 10235.125319: funcgraph_entry: | md_make_request() { block_hasher-28990 [000] 10235.125321: funcgraph_entry: | make_request() { block_hasher-28990 [000] 10235.125322: funcgraph_entry: 0.367 us | md_write_start(); block_hasher-28990 [000] 10235.125323: funcgraph_entry: | bio_clone_mddev() { block_hasher-28990 [000] 10235.125323: funcgraph_entry: | bio_alloc_bioset() { block_hasher-28990 [000] 10235.125323: funcgraph_entry: | mempool_alloc() { block_hasher-28990 [000] 10235.125323: funcgraph_entry: 0.178 us | _cond_resched(); block_hasher-28990 [000] 10235.125324: funcgraph_entry: | mempool_alloc_slab() { block_hasher-28990 [000] 10235.125324: funcgraph_entry: | kmem_cache_alloc() { block_hasher-28990 [000] 10235.125324: funcgraph_entry: | cache_alloc_refill() { block_hasher-28990 [000] 10235.125325: funcgraph_entry: 0.275 us | _spin_lock(); block_hasher-28990 [000] 10235.125326: funcgraph_exit: 1.072 us | } block_hasher-28990 [000] 10235.125326: funcgraph_exit: 1.721 us | } block_hasher-28990 [000] 10235.125326: funcgraph_exit: 2.085 us | } block_hasher-28990 [000] 10235.125326: funcgraph_exit: 2.865 us | } block_hasher-28990 [000] 10235.125326: funcgraph_entry: 0.187 us | bio_init(); block_hasher-28990 [000] 10235.125327: funcgraph_exit: 3.665 us | } block_hasher-28990 [000] 10235.125327: funcgraph_entry: 0.229 us | __bio_clone(); block_hasher-28990 [000] 10235.125327: funcgraph_exit: 4.584 us | } block_hasher-28990 [000] 10235.125328: funcgraph_entry: 1.093 us | raid5_compute_sector(); block_hasher-28990 [000] 10235.125330: funcgraph_entry: | blk_recount_segments() { block_hasher-28990 [000] 10235.125330: funcgraph_entry: 0.340 us | __blk_recalc_rq_segments(); block_hasher-28990 [000] 10235.125331: funcgraph_exit: 0.769 us | } block_hasher-28990 [000] 10235.125331: funcgraph_entry: 0.202 us | _spin_lock_irq(); block_hasher-28990 [000] 10235.125331: funcgraph_entry: 0.194 us | generic_make_request(); block_hasher-28990 [000] 10235.125332: funcgraph_exit: + 10.613 us | } block_hasher-28990 [000] 10235.125332: funcgraph_exit: + 13.638 us | }

Much better! But for some reason, it doesn’t have mwait_idle calls. And it just calls generic_make_request. I’ve tried and record function graph for generic_make_request (-g option). Still no luck. I’ve extracted all function containing wait and here is the result:

# grep 'wait' graph_md.graph | cut -f 2 -d'|' | awk '{print $1}' | sort -n | uniq -c 18 add_wait_queue() 2064 bit_waitqueue() 1 bit_waitqueue(); 1194 finish_wait() 28 page_waitqueue() 2033 page_waitqueue(); 1222 prepare_to_wait() 25 remove_wait_queue() 4 update_stats_wait_end() 213 update_stats_wait_end();

(cut will separate function names, awk will print only function names, uniq with sort will reduce it to unique names)

Nothing related to timeouts. I’ve tried to grep for timeout and, damn, nothing suspicious.

So, right now I’m going to stop because it’s not going anywhere.

Conclusion

Well, it was really fun! ftrace is such a powerful tool but it’s made for debugging, not profiling. I was able to get kernel function call graph, get statistics for kernel execution on source code level (can you believe it?), trace some unknown function and all that happened thanks to ftrace. Bless it!

To read

Debugging the kernel using Ftrace - part 1, part 2

Secrets of the Ftrace function tracer

trace-cmd

Dynamic probes with ftrace

Dynamic event tracing in Linux kernel

This is how debugfs mounted: mount -t debugfs none /sys/kernel/debug ↩︎

Linux kernel profiling features

2014-05-12T00:00:00+00:00

Intro

Sometimes when you’re facing really hard performance problem it’s not always enough to profile your application. As we saw while profiling our application with gprof, gcov and Valgrind problem is somewhere underneath our application – something is holding pread in long I/O wait cycles.

How to trace system call is not clear at first sight – there are various kernel profilers, all of them works in its own way, requires unique configuration, methods, analysis and so on. Yes, it’s really hard to figure it out. Being the biggest open-source project developed by the massive community, Linux absorbed several different and sometimes conflicting profiling facilities. And it’s in some sense getting even worse – while some profiles tend to merge (ftrace and perf) other tools emerge – the last example is ktap.

To understand that bazaar let’s start from the bottom – what does kernel have that makes it able profile it? Basically, there are only 3 kernel facilities that enable profiling:

Kernel tracepoints

Kernel probes

Perf events

These are the features that give us access to the kernel internals. By using them we can measure kernel functions execution, trace access to devices, analyze CPU states and so on.

These very features are really awkward for direct use and accessible only from the kernel. Well, if you really want you can write your own Linux kernel module that will utilize these facilities for your custom use, but it’s pretty much pointless. That’s why people have created a few really good general purpose profilers:

ftrace

perf

SystemTap

ktap

All of them are based on that features and will be discussed later more thoroughly, but now let’s review features itself.

Kernel tracepoints

Kernel Tracepoints is a framework for tracing kernel function via static instrumenting¹.

Tracepoint is a place in the code where you can bind your callback. Tracepoints can be disabled (no callback) and enabled (has callback). There might be several callbacks though it’s still lightweight – when callback disabled it actually looks like if (unlikely(tracepoint.enabled)).

Tracepoint output is written in ring buffer that is export through debugfs at /sys/kernel/debug/tracing/trace. There is also the whole tree of traceable events at /sys/kernel/debug/tracing/events that exports control files to enable/disable particular event.

Despite its name tracepoints are the base for event-based profiling because besides tracing you can do anything in the callback, e.g. timestamping and measuring resource usage. Linux kernel is already (since 2.6.28) instrumented with that tracepoints in many places. For example, __do_kmalloc:

/** * __do_kmalloc - allocate memory * @size: how many bytes of memory are required. * @flags: the type of memory to allocate (see kmalloc). * @caller: function caller for debug tracking of the caller */ static __always_inline void *__do_kmalloc(size_t size, gfp_t flags, unsigned long caller) { struct kmem_cache *cachep; void *ret; /* If you want to save a few bytes .text space: replace * __ with kmem_. * Then kmalloc uses the uninlined functions instead of the inline * functions. */ cachep = kmalloc_slab(size, flags); if (unlikely(ZERO_OR_NULL_PTR(cachep))) return cachep; ret = slab_alloc(cachep, flags, caller); trace_kmalloc(caller, ret, size, cachep->size, flags); return ret; }
trace_kmalloc is tracepoint. There are many others in other critical parts of kernel such as schedulers, block I/O, networking and even interrupt handlers. All of them are used by most profilers because they have minimal overhead, fires by the event and saves you from modifying the kernel.

Ok, so by now you may be eager to insert it in all of your kernel modules and profile it to hell, but BEWARE. If you want to add tracepoints you must have a lot of patience and skills because writing your own tracepoints is really ugly and awkward. You can see examples at samples/trace_events/. Under the hood tracepoint is a C macro black magic that only bold and fearless persons could understand.

And even if you do all that crazy macro declarations and struct definitions it might just simply not work at all if you have CONFIG_MODULE_SIG=y and don’t sign module. It might seem kinda strange configuration but in reality, it’s a default for all major distributions including Fedora and Ubuntu. That said, after 9 circles of hell, you will end up with nothing.

So, just remember:

USE ONLY EXISTING TRACEPOINTS IN KERNEL, DO NOT CREATE YOUR OWN.

Now I’m gonna explain why it’s happening. So if you tired of tracepoints just skip reading about kprobes.

Ok, so some time ago while preparing kernel 3.1² this code was added:

static int tracepoint_module_coming(struct module *mod) { struct tp_module *tp_mod, *iter; int ret = 0; /* * We skip modules that tain the kernel, especially those with different * module header (for forced load), to make sure we don't cause a crash. */ if (mod->taints) return 0;
If the module is tainted we will NOT write ANY tracepoints. Later it became more adequate

/* * We skip modules that taint the kernel, especially those with different * module headers (for forced load), to make sure we don't cause a crash. * Staging and out-of-tree GPL modules are fine. */ if (mod->taints & ~((1 << TAINT_OOT_MODULE) | (1 << TAINT_CRAP))) return 0;
Like, ok it may be out-of-tree (TAINT_OOT_MODULE) or staging (TAINT_CRAP) but any others are the no-no.

Seems legit, right? Now, what would you think will be if your kernel is compiled with CONFIG_MODULE_SIG enabled and your pretty module is not signed? Well, module loader will set the TAINT_FORCES_MODULE flag for it. And now your pretty module will never pass the condition in tracepoint_module_coming and never show you any tracepoints output. And as I said earlier this stupid option is set for all major distributions including Fedora and Ubuntu since kernel version 3.1.

If you think – “Well, let’s sign goddamn module!” – you’re wrong. Modules must be signed with kernel private key that is held by your Linux distro vendor and, of course, not available for you.

The whole terrifying story is available in lkml 1, 2.

As for me I just cite my favorite thing from Steven Rostedt (ftrace maintainer and one of the tracepoints developer):

> OK, this IS a major bug and needs to be fixed. This explains a couple > of reports I received about tracepoints not working, and I never > figured out why. Basically, they even did this: > > > trace_printk("before tracepoint\n"); > trace_some_trace_point(); > trace_printk("after tracepoint\n"); > > Enabled the tracepoint (it shows up as enabled and working in the > tools, but not the trace), but in the trace they would get: > > before tracepoint > after tracepoint > > and never get the actual tracepoint. But as they were debugging > something else, it was just thought that this was their bug. But it > baffled me to why that tracepoint wasn't working even though nothing in > the dmesg had any errors about tracepoints. > > Well, this now explains it. If you compile a kernel with the following > options: > > CONFIG_MODULE_SIG=y > # CONFIG_MODULE_SIG_FORCE is not set > # CONFIG_MODULE_SIG_ALL is not set > > You now just disabled (silently) all tracepoints in modules. WITH NO > FREAKING ERROR MESSAGE!!! > > The tracepoints will show up in /sys/kernel/debug/tracing/events, they > will show up in perf list, you can enable them in either perf or the > debugfs, but they will never actually be executed. You will just get > silence even though everything appeared to be working just fine.

Recap:

Kernel tracepoints is a lightweight tracing and profiling facility.

Linux kernel is heavy instrumented with tracepoints that are used by the most profilers and especially by perf and ftrace.

Tracepoints are C marco black magic and almost impossible for usage in kernel modules.

It will NOT work in your LKM if:

Kernel version >=3.1 (might be fixed in 3.15)

CONFIG_MODULE_SIG=y

Your module is not signed with kernel private key.

Kernel probes

Kernel probes is a dynamic debugging and profiling mechanism that allows you to break into kernel code, invoke your custom function called probe and return everything back.

Basically, it’s done by writing kernel module where you register a handler for some address or symbol in kernel code. Also according to the definition of struct kprobe, you can pass offset from address but I’m not sure about that. In your registered handler you can do really anything – write to the log, to some buffer exported via sysfs, measure time and an infinite amount of possibilities to do. And that’s really nifty contrary to tracepoints where you can only read logs from debugfs.

There are 3 types of probes:

kprobes – basic probe that allows you to break into any kernel address.

jprobes – jump probes that inserted in the start of the function and gives you handy access to function arguments; it’s something like proxy-function.

kretprobes – return probes that inserted at the return point of the function.

Last 2 types are based on basic kprobes.

All of this generally works like this:

We register probe on some address A.

kprobe subsystem finds A.

kprobe copies instruction at address A.

kprobe replaces instruction at A for breakpoint (int 3 in the case of x86).

Now when execution hits probed address A, CPU trap occurs.

Registers are saved.

CPU transfers control to kprobes via notifier_call_chain mechanism.

And finally, kprobes invokes our handler.

After all, we restore registers, copies back instruction at A and continues execution.

Our handler usually gets as an argument address where breakpoint happened and registers values in pt_args structures. kprobes handler prototype:

typedef int (*kprobe_break_handler_t) (struct kprobe *, struct pt_regs *);
In most cases except debugging this info is useless because we have jprobes. jprobes handler has exactly the same prototype as and intercepting function. For example, this is handler for do_fork:

/* Proxy routine having the same arguments as actual do_fork() routine */ static long jdo_fork(unsigned long clone_flags, unsigned long stack_start, struct pt_regs *regs, unsigned long stack_size, int __user *parent_tidptr, int __user *child_tidptr)
Also, jprobes doesn’t cause interrupts because it works with help of setjmp/longjmp that are much more lightweight.

And finally, the most convenient tool for profiling are kretprobes. It allows you to register 2 handlers – one to invoke on function start and the other to invoke in the end. But the really cool feature is that it allows you to save state between those 2 calls, like timestamp or counters.

Instead of thousand words – look at absolutely astonishing samples at samples/kprobes.

Recap:

kprobes is a beautiful hack for dynamic debugging, tracing and profiling.

It’s a fundamental kernel feature for non-invasive profiling.

Perf events

perf_events is an interface for hardware metrics implemented in PMU (Performance Monitoring Unit) which is part of CPU.

Thanks to perf_events you can easily ask the kernel to show you L1 cache misses count regardless of what architecture you are on – x86 or ARM. What CPUs are supported by perf are listed here.

In addition to that perf included various kernel metrics like software context switches count (PERF_COUNT_SW_CONTEXT_SWITCHES).

And in addition to that perf included tracepoint support via ftrace.

To access perf_events there is a special syscall perf_event_open. You are passing the type of event (hardware, kernel, tracepoint) and so-called config, where you specify what exactly you want depending on type. It’s gonna be a function name in case of tracepoint, some CPU metric in the case of hardware and so on.

And on top of that, there are a whole lot of stuff like event groups, filters, sampling, various output formats and others. And all of that is constantly breaking³, that’s why the only thing you can ask for perf_events is special perf utility – the only userspace utility that is a part of the kernel tree.

perf_events and all things related to it spread as a plague in the kernel and now ftrace is going to be part of perf (1, 2). Some people overreacting on perf related things though it’s useless because perf is developed by kernel big fishes – Ingo Molnar⁴ and Peter Zijlstra.

I really can’t tell anything more about perf_events in isolation of perf, so here I finish.

Summary

There are a few Linux kernel features that enable profiling:

tracepoints

kprobes

perf_events

All Linux kernel profilers use some combinations of that features, read details in an article for the particular profiler.

To read

https://events.linuxfoundation.org/sites/events/files/slides/kernel_profiling_debugging_tools_0.pdf

http://events.linuxfoundation.org/sites/events/files/lcjp13_zannoni.pdf

tracepoints:

Documentation/trace/tracepoints.txt

http://lttng.org/files/thesis/desnoyers-dissertation-2009-12-v27.pdf

http://lwn.net/Articles/379903/

http://lwn.net/Articles/381064/

http://lwn.net/Articles/383362/

kprobes:

Documentation/kprobes.txt

https://lwn.net/Articles/132196/

perf_events:

http://web.eece.maine.edu/~vweaver/projects/perf_events/

https://lwn.net/Articles/441209/

Tracepoints are improvement of early feature called kernel markers. ↩︎

Namely in commit b75ef8b44b1cb95f5a26484b0e2fe37a63b12b44 ↩︎

And that’s indended behaviour. Kernel ABI in no sense stable, API is. ↩︎

Author of current default O(1) process scheduler CFS - Completely Fair Scheduler. ↩︎

Valgrind

2014-03-15T00:00:00+00:00

Contrary to popular belief, Valgrind is not a single tool, but a suite of such tools, with Memcheck being a default one. By the time of writing Valgrind suite consists of:

Memcheck – memory management errors detection.

Cachegrind – CPU cache access profiling tool.

Massif – sampling heap profiler.

Helgrind – race condition detector.

DRD – tool to detect errors in multithreading applications.

Plus there are unofficial tools not included in Valgrind and distributed as patches.

The biggest plus of Valgrind is that we don’t need to recompile or modify our program in any way because Valgrind tools use emulation as a method of profiling. All of that tools are using common infrastructure that emulates application runtime – memory management function, CPU caches, threading primitives, etc. That’s where our program is executing and being analyzed by Valgrind.

In the examples below, I’ll use my block_hasher program to illustrate the usage of profilers. because it’s a small and simple utility.

Now let’s look at what Valgrind can do.

Memcheck

Ok, so Memcheck is a memory errors detector – it’s one of the most useful tools in programmer’s toolbox.

Let’s launch our hasher under Memcheck

$ valgrind --leak-check=full ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000 ==4323== Memcheck, a memory error detector ==4323== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al. ==4323== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info ==4323== Command: ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000 ==4323== ==4323== ==4323== HEAP SUMMARY: ==4323== in use at exit: 16 bytes in 1 blocks ==4323== total heap usage: 43 allocs, 42 frees, 10,491,624 bytes allocated ==4323== ==4323== LEAK SUMMARY: ==4323== definitely lost: 0 bytes in 0 blocks ==4323== indirectly lost: 0 bytes in 0 blocks ==4323== possibly lost: 0 bytes in 0 blocks ==4323== still reachable: 16 bytes in 1 blocks ==4323== suppressed: 0 bytes in 0 blocks ==4323== Reachable blocks (those to which a pointer was found) are not shown. ==4323== To see them, rerun with: --leak-check=full --show-reachable=yes ==4323== ==4323== For counts of detected and suppressed errors, rerun with: -v ==4323== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)

I won’t explain what is definitely lost, indirectly lost and other – that’s what is documentation for.

From Memcheck profile we can say that there are no errors except little leak, 1 block is still reachable. From the message

total heap usage: 43 allocs, 42 frees, 10,491,624 bytes allocated

I have somewhere forgotten to call free. And that’s true, in bdev_open I’m allocating struct for block_device but in bdev_close it’s not freeing. By the way, it’s interesting that Memcheck reports about 16 bytes loss, while block_device is int and off_t that should occupy 4 + 8 = 12 bytes. Where are 4 more bytes? Structs are 8 bytes aligned (for 64-bit system), that’s why int field is padded with 4 bytes.

Anyway, I’ve fixed memory leak:

@@ -240,6 +241,9 @@ void bdev_close( struct block_device *dev ) perror("close"); } + free(dev); + dev = NULL; + return; }
Check:

$ valgrind --leak-check=full ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000 ==15178== Memcheck, a memory error detector ==15178== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al. ==15178== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info ==15178== Command: ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 1000 ==15178== ==15178== ==15178== HEAP SUMMARY: ==15178== in use at exit: 0 bytes in 0 blocks ==15178== total heap usage: 43 allocs, 43 frees, 10,491,624 bytes allocated ==15178== ==15178== All heap blocks were freed -- no leaks are possible ==15178== ==15178== For counts of detected and suppressed errors, rerun with: -v ==15178== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)

A real pleasure to see.

As a resume, I’d like to say that Memcheck can do a lot. Not only in detection of memory errors, but also in explaining. It’s understatement to say “Hey, you’ve got some error here!” – to fix the error it’s better to know what is the reason. And Memcheck does it. It’s so good that it’s even listed as a skill for system programmers positions.

TODO:

More examples of memory errors

track origins

CacheGrind

Cachegrind – CPU cache access profiler. What amazed me is that how it trace cache accesses – Cachegrind simulates it, seean excerpt from the documentation:

It performs detailed simulation of the I1, D1 and L2 caches in your CPU and so can accurately pinpoint the sources of cache misses in your code.

If you think it’s easy, please spend 90 minutes to read this great article.

Let’s collect profile!

$ valgrind --tool=cachegrind ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000 ==9408== Cachegrind, a cache and branch-prediction profiler ==9408== Copyright (C) 2002-2010, and GNU GPL'd, by Nicholas Nethercote et al. ==9408== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info ==9408== Command: ./block_hasher -d /dev/md126 -b 1048576 -t 10 -n 1000 ==9408== --9408-- warning: Unknown Intel cache config value (0xff), ignoring --9408-- warning: L2 cache not installed, ignore LL results. ==9408== ==9408== I refs: 167,774,548,454 ==9408== I1 misses: 1,482 ==9408== LLi misses: 1,479 ==9408== I1 miss rate: 0.00% ==9408== LLi miss rate: 0.00% ==9408== ==9408== D refs: 19,989,520,856 (15,893,212,838 rd + 4,096,308,018 wr) ==9408== D1 misses: 163,354,097 ( 163,350,059 rd + 4,038 wr) ==9408== LLd misses: 74,749,207 ( 74,745,179 rd + 4,028 wr) ==9408== D1 miss rate: 0.8% ( 1.0% + 0.0% ) ==9408== LLd miss rate: 0.3% ( 0.4% + 0.0% ) ==9408== ==9408== LL refs: 163,355,579 ( 163,351,541 rd + 4,038 wr) ==9408== LL misses: 74,750,686 ( 74,746,658 rd + 4,028 wr) ==9408== LL miss rate: 0.0% ( 0.0% + 0.0% )

First thing, I look at – cache misses. But here it’s less than 1% so it can’t be the problem.

If you asking yourself how Cachegrind can be useful, I’ll tell you one of the work stories. To accelerate some of the RAID calculation algorithms my colleague has reduced multiplications for the price of increased additions and complicated data structure. In theory, it should’ve worked better like in Karatsuba multiplication. But in reality, it became much worse. After few days of hard debugging, we launched it under Cachegrind and saw cache miss rate about 80%. More additions invoked more memory accesses and reduced locality. So we abandoned the idea.

IMHO cachegrind is not that useful anymore since the advent of perf which does actual cache profiling using CPU’s PMU (performance monitoring unit), so perf is more precise and has much lower overhead.

Massif

Massif – heap profiler, in the sense that it shows dynamic of heap allocations, i.e. how much memory your applications were using at some moment.

To do that Massif samples heap state, generating a file that later transformed to report with help of ms_print tool.

Ok, launch it

$ valgrind --tool=massif ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 100 ==29856== Massif, a heap profiler ==29856== Copyright (C) 2003-2010, and GNU GPL'd, by Nicholas Nethercote ==29856== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info ==29856== Command: ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 100 ==29856== ==29856==

Got a massif.out.29856 file. Convert it to text profile:

$ ms_print massif.out.29856 > massif.profile

Profile contains histogram of heap allocations

MB 10.01^::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::# |: # |@ #:: |@ # : |@ # :: |@ # :: |@ # ::@ |@ # ::@ |@ # ::@ |@ # ::@ |@ # ::@ |@ # ::@ |@ # ::@@ |@ # ::@@ |@ # ::@@ |@ # ::@@ |@ # ::@@ |@ # ::@@ |@ # ::@@ |@ # ::@@ 0 +----------------------------------------------------------------------->Gi 0 15.63

and a summary table of most notable allocations.

Example:

-------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 40 344,706 9,443,296 9,442,896 400 0 41 346,448 10,491,880 10,491,472 408 0 42 346,527 10,491,936 10,491,520 416 0 43 346,723 10,492,056 10,491,624 432 0 44 15,509,791,074 10,492,056 10,491,624 432 0 100.00% (10,491,624B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->99.94% (10,485,760B) 0x401169: thread_func (block_hasher.c:142) | ->99.94% (10,485,760B) 0x54189CF: start_thread (in /lib64/libpthread-2.12.so) | ->09.99% (1,048,576B) 0x6BDC6FE: ??? | | | ->09.99% (1,048,576B) 0x7FDE6FE: ??? | | | ->09.99% (1,048,576B) 0x75DD6FE: ??? | | | ->09.99% (1,048,576B) 0x93E06FE: ??? | | | ->09.99% (1,048,576B) 0x89DF6FE: ??? | | | ->09.99% (1,048,576B) 0xA1E16FE: ??? | | | ->09.99% (1,048,576B) 0xABE26FE: ??? | | | ->09.99% (1,048,576B) 0xB9E36FE: ??? | | | ->09.99% (1,048,576B) 0xC3E46FE: ??? | | | ->09.99% (1,048,576B) 0xCDE56FE: ??? | ->00.06% (5,864B) in 1+ places, all below ms_print's threshold (01.00%)

In the table above we can see that we usually allocate in 10 MiB portions that are really just a 10 blocks of 1 MiB (our block size). Nothing special but it was interesting.

Of course, Massif is useful, because it can show you a history of allocation, how much memory was allocated with all the alignment and also what code pieces allocated most. Too bad I don’t have heap errors.

Helgrind

Helgrind is not a profiler but a tool to detect threading errors. It’s a thread debugger.

I just show how I’ve fixed bug in my code with Helgrind help.

When I’ve launched my block_hasher under it I was sure that I will have 0 errors, but stuck in debugging for a couple of days.

$ valgrind --tool=helgrind ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 100 ==3930== Helgrind, a thread error detector ==3930== Copyright (C) 2007-2010, and GNU GPL'd, by OpenWorks LLP et al. ==3930== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info ==3930== Command: ./block_hasher -d /dev/md0 -b 1048576 -t 10 -n 100 ==3930== ==3930== Thread #3 was created ==3930== at 0x571DB2E: clone (in /lib64/libc-2.12.so) ==3930== by 0x541E8BF: do_clone.clone.0 (in /lib64/libpthread-2.12.so) ==3930== by 0x541EDA1: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.12.so) ==3930== by 0x4C2CE76: pthread_create_WRK (hg_intercepts.c:257) ==3930== by 0x4019F0: main (block_hasher.c:350) ==3930== ==3930== Thread #2 was created ==3930== at 0x571DB2E: clone (in /lib64/libc-2.12.so) ==3930== by 0x541E8BF: do_clone.clone.0 (in /lib64/libpthread-2.12.so) ==3930== by 0x541EDA1: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread-2.12.so) ==3930== by 0x4C2CE76: pthread_create_WRK (hg_intercepts.c:257) ==3930== by 0x4019F0: main (block_hasher.c:350) ==3930== ==3930== Possible data race during write of size 4 at 0x5200380 by thread #3 ==3930== at 0x4E98AF8: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.1e) ==3930== by 0x4F16FF6: EVP_MD_CTX_create (in /usr/lib64/libcrypto.so.1.0.1e) ==3930== by 0x401231: thread_func (block_hasher.c:163) ==3930== by 0x4C2D01D: mythread_wrapper (hg_intercepts.c:221) ==3930== by 0x541F9D0: start_thread (in /lib64/libpthread-2.12.so) ==3930== by 0x75E46FF: ??? ==3930== This conflicts with a previous write of size 4 by thread #2 ==3930== at 0x4E98AF8: CRYPTO_malloc (in /usr/lib64/libcrypto.so.1.0.1e) ==3930== by 0x4F16FF6: EVP_MD_CTX_create (in /usr/lib64/libcrypto.so.1.0.1e) ==3930== by 0x401231: thread_func (block_hasher.c:163) ==3930== by 0x4C2D01D: mythread_wrapper (hg_intercepts.c:221) ==3930== by 0x541F9D0: start_thread (in /lib64/libpthread-2.12.so) ==3930== by 0x6BE36FF: ??? ==3930== ==3930== ==3930== For counts of detected and suppressed errors, rerun with: -v ==3930== Use --history-level=approx or =none to gain increased speed, at ==3930== the cost of reduced accuracy of conflicting-access information ==3930== ERROR SUMMARY: 9 errors from 1 contexts (suppressed: 955 from 9)

As we see, EVP_MD_CTX_create leads to a data race. This is an OpenSSL’s ¹ function that initializes context for hash calculation. I calculate the hash for blocks read in each thread with EVP_DigestUpdate and then write it to file after final EVP_DigesFinal_ex. So these Helgrind errors are related to OpenSSL functions. And I asked myself – “Is libcrypto thread-safe?”. So I used my google-fu and the answer is – by default no. To use EVP functions in multithreaded applications OpenSSL recommends to either register 2 crazy callbacks or use dynamic locks (see here). As for me, I’ve just wrapped context initialization in pthread mutex and that’s it.

@@ -159,9 +159,11 @@ void *thread_func(void *arg) gap = num_threads * block_size; // Multiply here to avoid integer overflow // Initialize EVP and start reading + pthread_mutex_lock( &mutex ); md = EVP_sha1(); mdctx = EVP_MD_CTX_create(); EVP_DigestInit_ex( mdctx, md, NULL ); + pthread_mutex_unlock( &mutex );
If anyone knows something about this – please, tell me.

DRD

DRD is one more tool in Valgrind suite that can detect thread errors. It’s more thorough and has more features while less memory hungry.

In my case, it has detected some mysterious pread data race.

==16358== Thread 3: ==16358== Conflicting load by thread 3 at 0x0563e398 size 4 ==16358== at 0x5431030: pread (in /lib64/libpthread-2.12.so) ==16358== by 0x4012D9: thread_func (block_hasher.c:174) ==16358== by 0x4C33470: vgDrd_thread_wrapper (drd_pthread_intercepts.c:281) ==16358== by 0x54299D0: start_thread (in /lib64/libpthread-2.12.so) ==16358== by 0x75EE6FF: ???

pread itself is thread-safe in the sense that it can be called from multiple threads, but access to data might be not synchronized. For example, you can call pread in one thread while calling pwrite in another and that’s where you got data race. But in my case data blocks do not overlap, so I can’t tell what’s a real problem here.

Conclusion

The conclusion will be dead simple – learn how to use Valgrind, it’s extremely useful.

To read

Success stories:

rsyslog data race analysis

valgrind and ruby

Profiling MySQL Memory Usage With Valgrind Massif

The design and implementation of Valgrind. Detailed technical notes for hackers, maintainers and the overly-curious

libcrypto is a library of cryptography functions and primitives that’s openssl is based on. ↩︎

gprof and gcov

2014-02-10T00:00:00+00:00

gprof and gcov are classical profilers that are still in use. Since the dawn of time, they were used by hackers to gain insight into their programs at the source code level.

In the examples below, I’ll use my block_hasher program to illustrate the usage of profilers. because it’s a small and simple utility.

gprof

gprof (GNU Profiler) – simple and easy profiler that can show how much time your program spends in routines in percents and seconds. gprof uses source code instrumentation by inserting special mcount function call to gather metrics of your program.

Building with gprof instrumentation

To gather profile you need to compile your program with -pg gcc option and then launch under gprof. For better results and statistical errors elimination, it’s recommended to launch profiling subject several times.

To build with gprof instrumentation invoke gcc like this:

$ gcc -pg -g prog.c -o prog

Here is the actual compile instructions for the block_hasher:

$ gcc -lrt -pthread -lcrypto -pg -g block_hasher.c -o block_hasher

As a result, you’ll get instrumented program. To check if it’s really instrumented just grep mcount symbol.

$ nm block_hasher | grep mcount U mcount@@GLIBC_2.2.5

Profiling block_hasher under gprof

As I said earlier to collect useful statistic we should run the program several times and accumulate metrics. To do that I’ve written simple bash script:

#!/bin/bash if [[ $# -lt 2 ]]; then echo "$0 " exit 1 fi RUNS=$1 shift 1 COMMAND="$@" # Profile name is a program name (first element in args) PROFILE_NAME="$(echo "${COMMAND}" | cut -f1 -d' ')" for i in $(seq 1 ${RUNS}); do # Run profiled program eval "${COMMAND}" # Accumulate gprof statistic if [[ -e gmon.sum ]]; then gprof -s ${PROFILE_NAME} gmon.out gmon.sum else mv gmon.out gmon.sum fi done # Make final profile gprof ${PROFILE_NAME} gmon.sum > gmon.profile
So, each launch will create gmon.out that gprof will combine in gmon.sum. Finally, gmon.sum will be feed to gprof to get flat text profile and call graph.

Let’s do this for our program:

$ ./gprofiler.sh 10 ./block_hasher -d /dev/sdd -b 1048576 -t 10 -n 1000

After finish, this script will create gmon.profile - a text profile, that we can analyze.

Analyzing

The flat profile has info about program routines and time spent in it.

Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls Ts/call Ts/call name 100.24 0.01 0.01 thread_func 0.00 0.01 0.00 50 0.00 0.00 time_diff 0.00 0.01 0.00 5 0.00 0.00 bdev_close 0.00 0.01 0.00 5 0.00 0.00 bdev_open

gprof metrics are clear from the name. As we can see almost all of it’s time our little program spent in thread function, BUT look at the actual seconds – only 0.01 seconds from whole program execution. It means that it’s not the thread function who is slowing down but something underlying. In the case of block_hasher, it’s a pread syscall that does the I/O for the block device.

The call graph is really not interesting here, so I won’t show you it, sorry.

gcov

gcov (short for GNU Coverage) – tool to collect function call statistics line-by-line. Usually it’s used in pair with gprof to understand what exact line in slacking function is holds your program down.

Building with gcov instrumentation

Just as gprof you need to recompile your program with gcov flags

# gcc -fprofile-arcs -ftest-coverage -lcrypto -pthread -lrt -Wall -Wextra block_hasher.c -o block_hasher

There are 2 gcov flags: -fprofile-arcs и -ftest-coverage. First will instrument your program to profile so called arcs – paths in program’s control flow. The second option will make gcc to collect additional notes for arcs profiling and gcov itself.

You can simply put --coverage option which implies both of -fprofile-arcs and -ftest-coverage along with linker -lgcov flag. See GCC debugging options for more info.

Profiling block_hasher under gcov

Now, after instrumenting we just launch the program to get 2 files – block_hasher.gcda and block_hasher.gcno. Please, don’t look inside it – we will transform it to text profile. To do this we execute gcov passing it source code filename. It’s important to remember that you must have .gcda and .gcno files.

$ gcov block_hasher.c File 'block_hasher.c' Lines executed:77.69% of 121 block_hasher.c:creating 'block_hasher.c.gcov'

Finally, we’ll get block_hasher.c.gcov.

Analyzing

.gcov file is result of that whole gcov work. Let’s look at it. For each of your source files gcov will create annotated source codes with runtime coverage. Here is excerpt from thread_func:

10: 159: gap = num_threads * block_size; // Multiply here to avoid integer overflow -: 160: -: 161: // Initialize EVP and start reading 10: 162: md = EVP_sha1(); 10: 163: mdctx = EVP_MD_CTX_create(); 10: 164: EVP_DigestInit_ex( mdctx, md, NULL ); -: 165: 10: 166: get_clock( &start ); 10010: 167: for( i = 0; i < nblocks; i++) -: 168: { 10000: 169: offset = j->off + gap * i; -: 170: -: 171: // Read at offset without changing file pointer 10000: 172: err = pread( bdev->fd, buf, block_size, offset ); 9999: 173: if( err == -1 ) -: 174: { #####: 175: fprintf(stderr, "T%02d Failed to read at %llu\n", j->num, (unsigned long long)offset); #####: 176: perror("pread"); #####: 177: pthread_exit(NULL); -: 178: } -: 179: 9999: 180: bytes += err; // On success pread returns bytes read -: 181: -: 182: // Update digest 9999: 183: EVP_DigestUpdate( mdctx, buf, block_size ); -: 184: } 10: 185: get_clock( &end ); 10: 186: sec_diff = time_diff( start, end ); -: 187: 10: 188: EVP_DigestFinal_ex( mdctx, j->digest, &j->digest_len ); 10: 189: EVP_MD_CTX_destroy(mdctx);

The left outmost column is how many times this line of code was invoked. As expected, the for loop was executed 10000 times – 10 threads each reading 1000 blocks. Nothing new.

Though gcov was not so much useful for me, I’d like to say that it has really cool feature – branch probabilities. If you’ll launch gcov with -b option

[root@simplex block_hasher]# gcov -b block_hasher.c File 'block_hasher.c' Lines executed:77.69% of 121 Branches executed:100.00% of 66 Taken at least once:60.61% of 66 Calls executed:51.47% of 68 block_hasher.c:creating 'block_hasher.c.gcov'

You’ll get nice branch annotation with probabilities. For example, here is time_diff function

113 function time_diff called 10 returned 100% blocks executed 100% 114 10: 106:double time_diff(struct timespec start, struct timespec end) 115 -: 107:{ 116 -: 108: struct timespec diff; 117 -: 109: double sec; 118 -: 110: 119 10: 111: if ( (end.tv_nsec - start.tv_nsec) < 0 ) 120 branch 0 taken 60% (fallthrough) 121 branch 1 taken 40% 122 -: 112: { 123 6: 113: diff.tv_sec = end.tv_sec - start.tv_sec - 1; 124 6: 114: diff.tv_nsec = 1000000000 + end.tv_nsec - start.tv_nsec; 125 -: 115: } 126 -: 116: else 127 -: 117: { 128 4: 118: diff.tv_sec = end.tv_sec - start.tv_sec; 129 4: 119: diff.tv_nsec = end.tv_nsec - start.tv_nsec; 130 -: 120: } 131 -: 121: 132 10: 122: sec = (double)diff.tv_nsec / 1000000000 + diff.tv_sec; 133 -: 123: 134 10: 124: return sec; 135 -: 125:}

In 60% of if calls we’ve fallen in the branch to calculate time diff with borrow, that actually means we were executing for more than 1 second.

Conclusion

gprof and gcov are really entertaining tools despite a lot of people think about them as obsolete. On the one hand, these utilities are simple, they implement and automate an obvious method of source code instrumentation to measure functions hit count.

But on the other hand, such simple metrics won’t help with problems outside of your application like kernel or library, although there are ways to use it for an operating system, e.g. for Linux kernel. Anyway, gprof and gcov are useless in the case when your application spends most of its time in waiting for some syscall (pread in my case).

To read

gprof manual

IBM tutorial

Utah university manual

Profiling

2014-01-30T00:00:00+00:00

Terms

Profiling – dynamic analysis of software, consisting of gathering various metrics and calculating some statistical info from it. Usually, you do profiling to analyze performance though it’s not the single case, e.g. there are works about profiling for energy consumption analysis.

Do not confuse profiling and tracing. Tracing is a procedure of saving program runtime steps to debug it – you are not gathering any metrics.

Also, don’t confuse profiling and benchmarking. Benchmarking is all about marketing. You launch some predefined procedure to get a couple of numbers that you can print in your marketing brochures.

Profiler – program that does profiling.

Profile – result of profiling, some statistical info calculated from gathered metrics.

Metrics

There are a lot of metrics that profiler can gather and analyze and I won’t list them all but instead try to make some hierarchy of it:

Time metrics

Program/function runtime

I/O latency

…

Space metrics

Memory usage

Open files

Bandwidth

…

Code metrics

Call graph

Function hit count

Loops depth

…

Hardware metrics

CPU cache hit/miss ratio

Interrupts count

…

Profiling methods

The variety of metrics implies the variety of methods to gather it. And I have a beautiful hierarchy for that, yeah:

Invasive profiling – changing profiled code

Source code instrumentation

Static binary instrumentation

Dynamic binary instrumentation

Non-invasive profiling – without changing any code

Sampling

Event-based

Emulation

(That’s all the methods I know. If you come up with another – feel free to contact me).

A quick review of methods.

Source code instrumentation is the simplest one. If you have source codes you can add special profiling calls to every function (not manually, of course) and then launch your program. Profiling calls will trace function graph and can also compute time spent in functions and also branch prediction probability and a lot of other things. But oftentimes you don’t have the source code. And that makes me saaaaad panda.

Binary instrumentation is what you can guess by yourself - you are modifying program binary image - either on disk (program.exe) or in memory. This is what reverse engineers love to do. To research some commercial critical software or analyze malware they do binary instrumentation and analyze program behavior.

Anyway, binary instrumentation also really useful in profiling – many modern instruments are built on top binary instrumentation ideas (SystemTap, ktap, DTrace).

Ok, so sometimes you can’t instrument even binary code, e.g. you’re profiling OS kernel, or some pretty complicated system consisting of many tightly coupled modules that won’t work after instrumenting. That’s why you have non-invasive profiling.

Sampling is the first natural idea that you can come up with when you can’t modify any code. The point is that profiler periodically asks CPU registers (e.g. PSW) and analyze what is going on. By the way, this is the only reasonable way you can get hardware metrics - by periodical polling of [PMU] (performance monitoring unit).

Event-based profiling is about gathering events that must somehow be prepared/preinstalled by the vendor of profiling subject. Examples are inotify, kernel tracepoints in Linux and VTune events.

And finally, emulation is just running your program in an isolated environment like virtual machine or QEMU thus giving you full control over program execution but garbling behavior.

Resources

Profiling wikibook

A tale about data corruption, stack and red zone

2014-01-27T00:00:00+00:00

It was a nice and calm work day when suddenly a wild colleague appeared in front of my desk and asked:

– Hey, uhmm, could you help me with some strange thing?

– Yeah, sure, what’s matter?

– I have data corruption and it’s happening in a really crazy manner.

If you don’t know, data/memory corruption is the single most nasty and awful bug that can happen in your program. Especially, when you are a storage developer.

So here was the case. We have RAID calculation algorithm. Nothing fancy – just a bunch of functions that gets a pointer to buffer, do some math over that buffer and then return it. Initially, calculation algorithm was written in userspace for simpler debugging, correctness proof and profiling and then ported to kernel space. And that’s where the problem started.

Firstly, when building from kbuild, gcc was just crashing¹ eating all the memory available. But I was not surprised at all considering files size – a dozen files each about 10 megabytes. Yes, 10 MB. Though that was not surprising for me, too. That sources were generated from the assembly and were actually a bunch of intrinsics. Anyway, it would be much better if gcc would not just crash.

So we’ve just written separate Makefile to build object files that will later be linked in the kernel module.

Secondly, data was not corrupted every time. When you were reading 1 GB from disks it was fine. And when you were reading 2 GB sometimes it was ok and sometimes not.

Thorough source code reading had led to nothing. We saw that memory buffer was corrupted exactly in calculation functions. But that functions were pure math: just a calculation with no side effects – it didn’t call any library functions, it didn’t change anything except passed buffer and local variables. And that changes to buffer were correct, while corruption was a real – calc functions just cannot generate such data.

And then we saw a pure magic. If we added to calc function single

printk("");

then data was not corrupted at all. I thought such things were subject of DailyWTF stories or developers jokes. We checked everything several times on different hosts – data was correct. Well, there was nothing left for us except disassembling object files to determine what was so special about printk.

So we did a diff between 2 object files with and without printk.

--- Calculation.s 2014-01-27 15:52:11.581387291 +0300 +++ Calculation_printk.s 2014-01-27 15:51:50.109512524 +0300 @@ -1,10 +1,15 @@ .file "Calculation.c" + .section .rodata.str1.1,"aMS",@progbits,1 +.LC0: + .string "" .text .p2align 4,,15 .globl Calculation_5d .type Calculation_5d, @function Calculation_5d: .LFB20: + subq $24, %rsp +.LCFI0: movq (%rdi), %rax movslq %ecx, %rcx movdqa (%rax,%rcx), %xmm4 @@ -46,7 +51,7 @@ pxor %xmm2, %xmm6 movdqa 96(%rax,%rcx), %xmm2 pxor %xmm5, %xmm1 - movdqa %xmm14, -24(%rsp) + movdqa %xmm14, (%rsp) pxor %xmm15, %xmm2 pxor %xmm5, %xmm0 movdqa 112(%rax,%rcx), %xmm14 @@ -108,11 +113,16 @@ movq 24(%rdi), %rax movdqa %xmm6, 80(%rax,%rcx) movq 24(%rdi), %rax - movdqa -24(%rsp), %xmm0 + movdqa (%rsp), %xmm0 movdqa %xmm0, 96(%rax,%rcx) movq 24(%rdi), %rax + movl $.LC0, %edi movdqa %xmm14, 112(%rax,%rcx) + xorl %eax, %eax + call printk movl $128, %eax + addq $24, %rsp +.LCFI1: ret .LFE20: .size Calculation_5d, .-Calculation_5d @@ -143,6 +153,14 @@ .long .LFB20 .long .LFE20-.LFB20 .uleb128 0x0 + .byte 0x4 + .long .LCFI0-.LFB20 + .byte 0xe + .uleb128 0x20 + .byte 0x4 + .long .LCFI1-.LCFI0 + .byte 0xe + .uleb128 0x8 .align 8 .LEFDE1: .ident "GCC: (GNU) 4.4.5 20110214 (Red Hat 4.4.5-6)"
Ok, looks like nothing changed much. String declaration in .rodata section, call to printk in the end. But what looked really strange to me is changes in %rsp manipulations. Seems like there were doing the same, but in the printk version they shifted in 24 bytes because in the start it does subq $24, %rsp.

We didn’t care much about it at first. On x86 architecture stack grows down, i.e. to smaller addresses. So to access local variables (these are on the stack) you create new stack frame by saving current %rsp in %rbp and shifting %rsp thus allocating space on the stack. This is called function prologue and it was absent in our assembly function without printk.

You need this stack manipulation later to access your local vars by subtracting from %rbp. But we were subtracting from %rsp, isn’t it strange?

Wait a minute… I decided to draw stack frame and got it!

Holy shucks! We are processing undefined memory. All instructions like this

movdqa -24(%rsp), %xmm0
moving aligned data from xmm0 to address rsp-24 is actually the access over the top of the stack!

WHY?

I was really shocked. So shocked that I even asked on stackoverflow. And the answer was

Red Zone

In short, the red zone is a memory piece of size 128 bytes over stack top, that according to amd64 ABI should not be accessed by any interrupt or signal handlers. And it was a rock-solid truth, but for userspace. When you are in kernel space leave the hope for extra memory – the stack is worth its weight in gold here. And you got a whole lot of interrupt handling here.

When an interruption occurs, the interrupt handler uses stack frame of the current kernel thread, but to avoid thread data corruption it holds it’s own data over stack top. And when our own code was compiled with red zone support the thread data were located over stack top as much as interrupt handlers data.

That’s why kernel compilation is done with -mno-red-zone gcc flag. It’s set implicitly by kbuild².

But remember that we were not able to build with kbuild because it was crashing every time due to huge files.

Anyway, we just added in our Makefile EXTRA_CFLAGS += -mno-red-zone and it’s working now. ~~But still, I have a question why adding printk("") leads to preventing using red zone and space allocation for local variables with subq $24, %rsp?~~ Recently, in 2020 a kind person reached out to me and said that the reason why adding printk("") prevented the crash was simply because it makes the calc function non-leaf - we call another function that can’t be inlined. Kudos to Chris Pearson for sharing this with me after 6 years!

So, that day I learned a really tricky optimization that at the cost of potential memory corruption could save you a couple of instructions for every leaf function.

That’s all, folks!

Crashed only as part of kbuild and only on version 4.4. ↩︎

To get all flags that kbuild set one can simply look at ..o.cmd. ↩︎

Index	Base	Size	Info
00 (Selector 0x0000)	`0x0`	`0xfff0`	Unused
01 (Selector 0x0008)	`0x0`	`0xffffffff`	32-bit code
02 (Selector 0x0010)	`0x0`	`0xffffffff`	32-bit data
03 (Selector 0x0018)	`0x0`	`0xffff`	16-bit code
04 (Selector 0x0020)	`0x0`	`0xffff`	16-bit data

There is no magic here

Nice nginx features for operators

Blocking bad clients

Rate limiting

Caching

Gradual rollout of a new service

Structured logs

Conclusion

Nice nginx features for developers

Active/Passive backend configuration

Proxy to Kubernetes service

Flexible routing with map

Passing request to Consul services

Conclusion

How to use Ansible check mode with async tasks

Redis experience

Intro

Our use case

Cluster

Loading data

Going forward

Memory consumption

Expires

Our memory consumption

Hash

Going forward

Persistence

Conclusion

Prometheus alerts examples

Prerequisites

Alerts

Hardware alerts with node_exporter

Redis alerts

Kafka alerts

Zookeeper alerts

Consul alerts

Conclusion

How to configure OS Login in GCP for Ansible

Service account

Configure OS Login

1. Add roles

2. Create key for service account and save it

3. Create SSH key for service account

4. Add SSH key for OS login to service account

5. Switch back from service account

Connecting to the instance with OS login

Configuring Ansible

Database connect loop in Go

How I revamped my Vim setup

1. Installing Vim the sane way

2. Use Vim help

3. Use missed core features

Auto commands

Persistent undo

Clipboard

Mappings

True colors in Vim

Search history

4. Tuning Vim to my workflow

Working with projects

Autocompletion

Quick file find

Quick search in files

Find usages

Git integration

Build and linter integration

Various niceties

ZoomWinTab

Sensible

Commentary

Surround

Conclusion

Envoy first impression

What’s great about Envoy

Observability

Advanced load balancing

Active checks

Extensibility

What’s not so great about Envoy

No caching

Flexible routing with `map`